diff --git a/README.md b/README.md index 5387df9..580ab0c 100644 --- a/README.md +++ b/README.md @@ -6,9 +6,11 @@ This repository contains technical specifications used by the [Webrecorder] proj * [Web Archive Collection Zipped (WACZ)]: a packaging standard for web archives on the web * [WACZ Signing and Verification]: the mechanics for signing and verifying WACZ files for proof of authenticity * [Crawl Index JSON (CDXJ)]: an extensible format for WARC index files +* [MultiWACZ]: logically aggregate WACZ files for integrated browsing and distribution [Webrecorder]: https://webrecorder.net [Web Archive Collection Zipped (WACZ)]: https://specs.webrecorder.net/wacz/latest/ [Use Cases for Decentralized Web Archives]: https://specs.webrecorder.net/use-cases/latest/ [WACZ Signing and Verification]: https://specs.webrecorder.net/wacz-auth/latest/ [Crawl Index JSON (CDXJ)]: https://specs.webrecorder.net/cdxj/latest/ +[MultiWACZ]: https://specs.webrecorder.net/multi-wacz/latest/ diff --git a/index.html b/index.html index 3d96322..df962f0 100644 --- a/index.html +++ b/index.html @@ -57,8 +57,10 @@ * [Use Cases for Decentralized Web Archives](use-cases/latest/): a summary of requirements and potential threat models for distributed web archives * [Web Archive Collection Zipped (WACZ)](wacz/latest/): a packaging standard for web archives on the web * [WACZ Signing and Verification](wacz-auth/latest/): the mechanics for signing and verifying WACZ files for proof of authenticity +* [MultiWACZ](multi-wacz/latest/): logically aggregate WACZ files for integrated browsing and distribution * [Crawl Index JSON (CDXJ)](cdxj/latest/): an extensible format for WARC index files +
diff --git a/multi-wacz/0.1.0/index.html b/multi-wacz/0.1.0/index.html new file mode 100644 index 0000000..35a5dd3 --- /dev/null +++ b/multi-wacz/0.1.0/index.html @@ -0,0 +1,226 @@ + + + + + MultiWACZ + + + + + + +
+
+ +
+ MultiWACZ provides a standard way of collecting [[WACZ]] files into a + conceptual unit. The aggregation is represented as a [[JSON]] manifest. +
+ +
+
+ +
+ +# Introduction + +MultiWACZ provides a standard [[JSON]] representation for +*aggregations* of [[WACZ]] files. An aggregation of WACZ files might seem +redundant at first, since WACZ is already an aggregation of [[WARC]] files, +which represent a collection of archived web content. However there are +situations where aggregating WACZ files can be helpful, such as: + +1. A website is archived multiple times on a schedule, which results in a number +of distinct WACZ files that need to be viewed together as a single archive. +3. The size of a website combined with storage constraints require an archive be +split or chunked into multiple WACZ files, which then need to be viewed together +as a whole. +3. A set of separately collected WACZ files needs to be grouped together for +viewing separately as part of a thematic collection. + +A MultiWACZ is a JSON object that lets you group WACZ files in these different +ways so that replay tools can perform them. The MultiWACZ may be represented as +a simple file that aggregates remote WACZ files, or it may be packaged up as a +ZIP file. +
+
+ +# Manifest + +Similar to WACZ, MultiWACZ has a JSON *manifest* which groups together the +individual WACZ files as a [[FRICTIONLESS-DATA-PACKAGE]]. Each resource in the +the `resources` list MUST be a fully qualified [[URL]] or a POSIX file path +relative to the manifest file. + +The manifest SHOULD contain metadata as defined in [[WACZ]] which describes the +aggregation as a whole, such as `title`, `description`, `created`. This +metadata is useful for viewing applications to control how the aggregation will +appear. Additional metadata properties MAY be included as long as they do not +override existing property names in [[WACZ]] or [[FRICTIONLESS-DATA-PACKAGE]]. + +## Minimal Example of a MultiWACZ Manifest + +This is an example of a minimal MultiWACZ manifest that aggregates two WACZ +files that are available on the Web. + +
+{
+  "profile": "multi-wacz",
+  "title": "My WACZ Aggregation",
+  "description": "This web archive contains example data for the MultiWACZ specification",
+  "created": "2023-01-25T12:00:00.48Z",
+  "resources": [
+     {
+       "name": "Archive 1",
+       "path": "https://example.com/archive/archive1.wacz"
+     },
+     {
+       "name": "Archive 2",
+       "path": "https://example.com/archive/archive2.wacz"
+     }
+  ]
+}
+
+ +
+ +
+ +# Display + +By default MultiWACZ objects are assumed to be `joined` in that the WACZ contents +are viewed as a single archive. This is useful in use cases described above when +multiple snapshots of a single website are taken over time, or broken up to +assist in storage. + +However it is sometimes desirable for the aggregated WACZ files to be viewed +individually, as is often the case in thematic collections that collect together +archives of related web sites. The manifest's `display` property can be set to +`separate`, which instructs the viewing application to present the aggregated +WACZ resources individually. + +## Example of Separate Display + +
+{
+  "profile": "multi-wacz",
+  "display": "separate",
+  "title": "My Thematic WACZ Aggregation",
+  "description": "This web archive contains websites related to the history of the Web",
+  "created": "2023-01-25T12:00:00.48Z",
+  "resources": [
+     {
+       "name": "The Birth of the Web (CERN)",
+       "path": "https://example.com/archive/archive1.wacz"
+     },
+     {
+       "name": "A short history of the Web (CERN)",
+       "path": "https://example.com/archive/archive2.wacz"
+     },
+     {
+       "name": "www-talk@w3.org Mail Archives",
+       "path": "https://example.com/archive/archive2.wacz"
+     }
+  ]
+}
+
+ +
+ +
+ +# ZIP Packaging + +In cases where it is desirable to package up the MultiWACZ and associated WACZ +files into a single file they can be combined into a [[ZIP]] file. Similar to +[[WACZ]] files the `.wacz` file extension SHOULD be used, and they should be +made available on the Web using the `application/wacz` media type. + +

+Include information here about compression. Should it be similar to WACZ? +

+ +
+ + + + diff --git a/multi-wacz/latest/index.html b/multi-wacz/latest/index.html new file mode 100644 index 0000000..59ec579 --- /dev/null +++ b/multi-wacz/latest/index.html @@ -0,0 +1,5 @@ + + +Redirecting to https://specs.webrecorder.net/wacz-agg/0.1.0/ + +