Page tar archive format¶
The archive should contain only files (on the top level), no directories. The following files are mandatory:
metainfo.json
- a file with metadataindex.html
- HTML markup of the page itself
All other files are resources used by the page (linked by it directly or indirectly).
metainfo.json
describes the contents of the local page copy. It must contain
a JSON object where each key is a name of the file contaning the contents of a
resource (the file contained in the tar archive). The value for the key is a JSON object
describing resource metadata. It should have the following 3 fields:
url
- full URL of the resourcetype
- the type of the resource as returned by Puppeteer'sresourceType()
methodheaders
- header map. Header map is a JSON object where keys are header names and each value is an array with header values (thus allowing to express the presence of multiple headers with the same name). An entry forindex.html
must be present inmetainfo.json
. Therefore, the URL of the page from which the tar copy was made can be obtained as aurl
field of the entry forindex.html
inmetainfo.json
.
An example of the page tar can be found here: sample.tar.
A example of the program creating page tar copies can be found here.
To test page tar copies this script can be used - it opens the copied in Google Chrome.