Skip to content

Page tar archive format

The archive should contain only files (on the top level), no directories. The following files are mandatory:

  • metainfo.json - a file with metadata
  • index.html - HTML markup of the page itself

All other files are resources used by the page (linked by it directly or indirectly). metainfo.json describes the contents of the local page copy. It must contain a JSON object where each key is a name of the file contaning the contents of a resource (the file contained in the tar archive). The value for the key is a JSON object describing resource metadata. It should have the following 3 fields:

  • url - full URL of the resource
  • type - the type of the resource as returned by Puppeteer's resourceType() method
  • headers - header map. Header map is a JSON object where keys are header names and each value is an array with header values (thus allowing to express the presence of multiple headers with the same name). An entry for index.html must be present in metainfo.json. Therefore, the URL of the page from which the tar copy was made can be obtained as a url field of the entry for index.html in metainfo.json.

An example of the page tar can be found here: sample.tar.

A example of the program creating page tar copies can be found here.

To test page tar copies this script can be used - it opens the copied in Google Chrome.