Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiWACZ #135

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,11 @@ This repository contains technical specifications used by the [Webrecorder] proj
* [Web Archive Collection Zipped (WACZ)]: a packaging standard for web archives on the web
* [WACZ Signing and Verification]: the mechanics for signing and verifying WACZ files for proof of authenticity
* [Crawl Index JSON (CDXJ)]: an extensible format for WARC index files
* [MultiWACZ]: logically aggregate WACZ files for integrated browsing and distribution

[Webrecorder]: https://webrecorder.net
[Web Archive Collection Zipped (WACZ)]: https://specs.webrecorder.net/wacz/latest/
[Use Cases for Decentralized Web Archives]: https://specs.webrecorder.net/use-cases/latest/
[WACZ Signing and Verification]: https://specs.webrecorder.net/wacz-auth/latest/
[Crawl Index JSON (CDXJ)]: https://specs.webrecorder.net/cdxj/latest/
[MultiWACZ]: https://specs.webrecorder.net/multi-wacz/latest/
2 changes: 2 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,10 @@
* [Use Cases for Decentralized Web Archives](use-cases/latest/): a summary of requirements and potential threat models for distributed web archives
* [Web Archive Collection Zipped (WACZ)](wacz/latest/): a packaging standard for web archives on the web
* [WACZ Signing and Verification](wacz-auth/latest/): the mechanics for signing and verifying WACZ files for proof of authenticity
* [MultiWACZ](multi-wacz/latest/): logically aggregate WACZ files for integrated browsing and distribution
* [Crawl Index JSON (CDXJ)](cdxj/latest/): an extensible format for WARC index files


</section>
<section data-format="markdown">

Expand Down
226 changes: 226 additions & 0 deletions multi-wacz/0.1.0/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
<!DOCTYPE html>
<html>
<head>
<meta charset="utf8">
<title>MultiWACZ</title>
<script src="../../assets/js/respec-webrecorder.js" class="remove" defer ></script>
<script class="remove">
var respecConfig = {
specStatus: "DRAFT",
publishDate: "2023-01-25",
license: "cc-by",
thisVersion: "https://specs.webrecorder.net/multi-wacz/0.1.0/",
latestVersion: "https://specs.webrecorder.net/multi-wacz/latest/",
shortName: "multi-wacz",
group: "WACZ",
lint: {
// turn off w3c-specific linting
"privsec-section": false,
"no-http-props": false,
"no-headingless-sections": false
},
includePermalinks: true,
authors: [],
editors: [
{
name: "Ilya Kreymer",
url: "https://github.com/ikreymer",
company: "Webrecorder",
companyURL: "https://webrecorder.net/"
},
{
name: "Ed Summers",
url: "https://www.linkedin.com/in/esummers/",
company: "Stanford University",
companyURL: "https://stanford.edu"
}
],
group: {
name: "WACZ Editors",
url: "https://webrecorder.net"
},
otherLinks: [
{
key: "Additional Documents",
data: [
{
value: "Use Cases for Decentralized Web Archives",
href: "https://specs.webrecorder.net/use-cases/latest/",
}
]
},
{
key: "Repository",
data: [
{
value: "Github",
href: "https://github.com/webrecorder/specs"
},
{
value: "Issues",
href: "https://github.com/webrecorder/specs/issues"
},
{
value: "Commits",
href: "https://github.com/webrecorder/specs/commits"
}
]
}
],
maxTocLevel: 3,
logos: [
{
src: "../../assets/images/webrecorder.svg",
alt: "Webrecorder Logo",
height: 100
}
],
localBiblio: {
"PYWB-CDXJ": {
title: "pywb Indexing: CDXJ Format",
publisher: "Webrecorder",
href: "https://pywb.readthedocs.io/en/latest/manual/indexing.html#cdxj-index",
}
},
};
</script>
</head>

<body>

<section id="sotd" class="introductory">
</section>

<section id="abstract">
MultiWACZ provides a standard way of collecting [[WACZ]] files into a
conceptual unit. The aggregation is represented as a [[JSON]] manifest.
</section>

<section id="conformance">
</section>

<section data-format="markdown">

# Introduction

MultiWACZ provides a standard [[JSON]] representation for
*aggregations* of [[WACZ]] files. An aggregation of WACZ files might seem
redundant at first, since WACZ is already an aggregation of [[WARC]] files,
which represent a collection of archived web content. However there are
situations where aggregating WACZ files can be helpful, such as:

1. A website is archived multiple times on a schedule, which results in a number
of distinct WACZ files that need to be viewed together as a single archive.
3. The size of a website combined with storage constraints require an archive be
split or chunked into multiple WACZ files, which then need to be viewed together
as a whole.
3. A set of separately collected WACZ files needs to be grouped together for
viewing separately as part of a thematic collection.

A MultiWACZ is a JSON object that lets you group WACZ files in these different
ways so that replay tools can perform them. The MultiWACZ may be represented as
a simple file that aggregates remote WACZ files, or it may be packaged up as a
ZIP file.
</section>
<section data-format="markdown">

# Manifest

Similar to WACZ, MultiWACZ has a JSON *manifest* which groups together the
individual WACZ files as a [[FRICTIONLESS-DATA-PACKAGE]]. Each resource in the
the `resources` list MUST be a fully qualified [[URL]] or a POSIX file path
relative to the manifest file.

The manifest SHOULD contain metadata as defined in [[WACZ]] which describes the
aggregation as a whole, such as `title`, `description`, `created`. This
metadata is useful for viewing applications to control how the aggregation will
appear. Additional metadata properties MAY be included as long as they do not
override existing property names in [[WACZ]] or [[FRICTIONLESS-DATA-PACKAGE]].

## Minimal Example of a MultiWACZ Manifest

This is an example of a minimal MultiWACZ manifest that aggregates two WACZ
files that are available on the Web.

<pre class="example">
{
"profile": "multi-wacz",
"title": "My WACZ Aggregation",
"description": "This web archive contains example data for the MultiWACZ specification",
"created": "2023-01-25T12:00:00.48Z",
"resources": [
{
"name": "Archive 1",
"path": "https://example.com/archive/archive1.wacz"
},
{
"name": "Archive 2",
"path": "https://example.com/archive/archive2.wacz"
}
]
}
</pre>

</section>

<section data-format="markdown">

# Display

By default MultiWACZ objects are assumed to be `joined` in that the WACZ contents
are viewed as a single archive. This is useful in use cases described above when
multiple snapshots of a single website are taken over time, or broken up to
assist in storage.

However it is sometimes desirable for the aggregated WACZ files to be viewed
individually, as is often the case in thematic collections that collect together
archives of related web sites. The manifest's `display` property can be set to
`separate`, which instructs the viewing application to present the aggregated
WACZ resources individually.

## Example of Separate Display

<pre class="example">
{
"profile": "multi-wacz",
"display": "separate",
"title": "My Thematic WACZ Aggregation",
"description": "This web archive contains websites related to the history of the Web",
"created": "2023-01-25T12:00:00.48Z",
"resources": [
{
"name": "The Birth of the Web (CERN)",
"path": "https://example.com/archive/archive1.wacz"
},
{
"name": "A short history of the Web (CERN)",
"path": "https://example.com/archive/archive2.wacz"
},
{
"name": "[email protected] Mail Archives",
"path": "https://example.com/archive/archive2.wacz"
}
]
}
</pre>

</section>

<section data-format="markdown">

# ZIP Packaging

In cases where it is desirable to package up the MultiWACZ and associated WACZ
files into a single file they can be combined into a [[ZIP]] file. Similar to
[[WACZ]] files the `.wacz` file extension SHOULD be used, and they should be
made available on the Web using the `application/wacz` media type.

<p class="ednote">
Include information here about compression. Should it be similar to WACZ?
</p>

</section>

</body>

</html>
5 changes: 5 additions & 0 deletions multi-wacz/latest/index.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
<!DOCTYPE html>
<meta charset="utf-8">
<title>Redirecting to https://specs.webrecorder.net/wacz-agg/0.1.0/</title>
<meta http-equiv="refresh" content="0; URL=https://specs.webrecorder.net/wacz-agg/0.1.0/">
<link rel="canonical" href="https://specs.webrecorder.net/wacz-agg/0.1.0/">