WACZ Aggregation / Multi WACZ Specification #112

edsu · 2022-03-07T10:03:20Z

Details about how to aggregate multiple WACZ files into a single WACZ need to be added to the specification. This hinges on resources in the datapackage.json using a url for a WACZ rather than a path. See the Resource Information section in the Data Package specification for details:

{
   "resources": [
      {"hash": "...", "url": "https://example.com/filename_1.wacz", "bytes": "..."}
      {"hash": "...", "url": "https://example.com/filename_2.wacz", "bytes": "..."}
   ]
   ...
}

There should also be a Data Package profile so that clients can easily distinguish between collections and regular WACZ files. Perhaps WACZ-Aggregation?

The specification should document that WACZ users MAY want to use the data-package.json as a place to record additional metadata about crawls. See the browsertrix-cloud API for examples.

The text was updated successfully, but these errors were encountered:

ikreymer · 2022-03-08T23:43:22Z

I wonder if this should be a separate specification, to avoid confusion. It is no longer a WACZ, which is a specific format, but a collection of WACZ, probably would be just a .json file (the datapackage.json) to start, which is a different specification, but of course depends on the WACZ.

The collection spec is sort of independent from the file format spec imo.

edsu · 2022-03-09T02:49:44Z

I don't think it should be separate. Unless I am misunderstanding (which is very possible) an understanding of a WACZ Aggregation without an understanding of WACZ would be pretty much useless. Why make it more difficult to manage as two separate specifications?

ikreymer · 2022-03-09T03:19:10Z

I don't think it should be separate. Unless I am misunderstanding (which is very possible) an understanding of a WACZ Collection without an understanding of WACZ would be pretty much useless. Why make it more difficult to manage as two separate specifications?

Well, it's really a completely different format, an aggregate format of wacz (and possibly other types) to form a collection.
The only overlap is that currently thinking of this also as a frictionless data package (though maybe that doesn't make sense given that it uses a url instead of an internal path).

Maybe it should really be called a 'web archive collection', which consists of collection-level data.

For example, maybe a collection could have a mix of wacz and regular warc files, which could be both be listed in the resources section. As conceived now, it's just a single json file, though could also change.

ikreymer · 2022-03-09T03:20:17Z

Another aspect that the 'WACZ aggregate' could have is a page-to-wacz mapping, along with resources, for example:

{
  "resources": [{
     "name": "example-com-crawl",
     "url": "https://store.example.org/example-com-crawl.wacz",
     "hash": ...
     "bytes": ...
   }, {
      "name": "another-site-com-crawl",
     "url": "https://store.example.org/another-site-com-crawl.wacz",
     "hash": ...
     "bytes": ...
    }],
    
   "pages": [
      "https://example.com/": {"filename": "example-com-crawl"},
      "https://another-site.example.com": {"filename":  "another-site-com-crawl"}
    ]
  }

This would help route the page to the correct wacz file, otherwise, would need to search through all of them..

edsu · 2022-03-09T03:30:35Z

I like WACZ Aggregation better than WACZ Collection, and have just updated some of the text above.

edsu · 2022-03-09T03:37:05Z

I guess my point is that if someone is building a WACZ client, as we want people to do, they will want to be clear about what their viewer needs to do. I think the easiest way to communicate that is with a single specification about what WACZ support means. If implementors need to digest multiple specifications I think we will risk losing them.

DiegoPino · 2022-03-09T13:23:06Z

Hi. Maybe something like a iiif collection manifest (so just a json or json-ld) might make sense here?

On Wed, Mar 9, 2022 at 12:19 AM Ilya Kreymer ***@***.***> wrote: I don't think it should be separate. Unless I am misunderstanding (which is very possible) an understanding of a WACZ Collection without an understanding of WACZ would be pretty much useless. Why make it more difficult to manage as two separate specifications? Well, it's really a completely different format, an aggregate format of wacz (and possibly other types) to form a collection. The only overlap is that currently thinking of this also as a frictionless data package (though maybe that doesn't make sense given that it uses a url instead of an internal path). Maybe it should really be called a 'web archive collection', which consists of collection-level data. For example, maybe a collection could have a mix of wacz and regular warc files, which could be both be listed in the resources section. As conceived now, it's just a single json file, though could also change. — Reply to this email directly, view it on GitHub <https://github.com/webrecorder/wacz-spec/issues/112#issuecomment-1062514371>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABU7ZZ5VHUN5EOD3PCCAGPLU7AKDTANCNFSM5QCZLZ5A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Diego Pino Navarro Digital Repositories Developer Metropolitan New York Library Council (METRO)

ikreymer · 2022-03-15T03:59:01Z

Hi. Maybe something like a iiif collection manifest (so just a json or json-ld) might make sense here?

Hm, that's an interesting idea.. yeah, its basically a collection manifest for multiple WACZs (and possibly WARCs), maybe that's the best way to look at it..

edsu · 2022-03-16T10:42:00Z

That's a useful comparison @DiegoPino!

It might be helpful to think about the WACZ specification as the equivalent of the Image API and this Aggregated WACZ view as the equivalent of the Presentation API. I could even imagine expressing the Aggregated WACZ as a IIIF Manifest since the IIIF Presentation API is oriented around the abstract idea of Canvases that can include images, video, audio. Why not a web archive too? But I fear that this might be a bit of a leaky abstraction because the sequencing of WACZs doesn't make a lot of sense in this context? Also the WACZ itself contains a lot of presentation metadata itself.

While the separation between Image and Presentation APIs in IIIF allows it to be more general I also think it was because they were developed separately in time. As someone who had to implement support for them at one point I can say that understanding and tracking them as separate specifications sometimes proved challenging. But I suspect other people may feel differently about that.

I think what you've identified here is that there's a bit of slippage between the WACZ media type (the ZIP file on the web) and WACZ as an API, which we see developing in tools like Browsertrix Cloud and which relates to other work like WASAPI. I also think this crops up in the Unzipped WACZ use case...

edsu · 2022-03-16T11:14:08Z

I noticed an inconsistency in the Frictionless DataPackage specification around URLs in Resources.

The Data Resource specification says:

path MUST be a string – or an array of strings (see “Data in Multiple
Files”). Each string MUST be a “url-or-path” as defined in the next section.

whereas another Data Package specification distinguishes between url and path. I think we should use the former (url-or-path) since it maps to implementations we have already, and the url usage seems to be from an earlier version of the DataPackage specification.

This commit adds some minimal information about aggregations. It is likely we will need to revisit this section as implementations start to use it. I think it will also relate to the TBD section in the spec on "unzipped" WACZ #96. Closes #112

ikreymer · 2022-03-17T20:02:05Z

Yes, that may be a good comparison. Thinking about it more, I think aggregation should really be a separate spec, but in this repo for now, because:

The aggregation spec is still in development, and should be versioned separately
Tools can work with single WACZ files without needing to support aggregation, or vice versa.
Different properties: WACZ file is designed to be fixed, while aggregations will be updated to include more WACZ files

I think it'll be really confusing to combine the more experimental aggregation format into the core format specification, which is already in use. I think aggregations should be a separate file, similar to use-cases, and probably have its own version for now - I imagine it'll need a bit more iteration before its put into use.

That's a useful comparison @DiegoPino!

It might be helpful to think about the WACZ specification as the equivalent of the Image API and this Aggregated WACZ view as the equivalent of the Presentation API. I could even imagine expressing the Aggregated WACZ as a IIIF Manifest since the IIIF Presentation API is oriented around the abstract idea of Canvases that can include images, video, audio. Why not a web archive too? But I fear that this might be a bit of a leaky abstraction because the sequencing of WACZs doesn't make a lot of sense in this context? Also the WACZ itself contains a lot of presentation metadata itself.

While the separation between Image and Presentation APIs in IIIF allows it to be more general I also think it was because they were developed separately in time. As someone who had to implement support for them at one point I can say that understanding and tracking them as separate specifications sometimes proved challenging. But I suspect other people may feel differently about that.

I think what you've identified here is that there's a bit of slippage between the WACZ media type (the ZIP file on the web) and WACZ as an API, which we see developing in tools like Browsertrix Cloud and which relates to other work like WASAPI. I also think this crops up in the Unzipped WACZ use case...

ikreymer · 2022-03-17T20:12:08Z

Not to the expand the immediate scope, but maybe this really needs its own repo, wacz-aggregation-spec.
I imagine there'll be a host of use cases

aggregation for wacz from a single crawl
aggregation for wacz across multiple crawls / collections of crawls
nested aggregation, eg. a collection aggregation that points to a list of crawl aggregations
immutable vs mutable, so can put an aggregation on IPFS and reference from other aggregations
We certainly don't have to figure this out now, just have a plan for how to tackle it in the future!

Yes, that may be a good comparison. Thinking about it more, I think aggregation should really be a separate spec, but in this repo for now, because:

The aggregation spec is still in development, and should be versioned separately

Tools can work with single WACZ files without needing to support aggregation, or vice versa.

Different properties: WACZ file is designed to be fixed, while aggregations will be updated to include more WACZ files

edsu · 2022-03-18T10:28:24Z

If aggregation isn't core to using WACZ then I agree it should be a separate specification, or perhaps just be in the description of how a specific tool works, and not a standard at all.

The purpose of positioning WACZ as a standard is to help others who are creating tools that use WACZ (crawlers, viewers, indexers, aggregators, etc). The specification should document the minimal requirements that allow developers to do that work, and also provide flexibility for them to add things that they need.

If it is to be a standard it's important that the WACZ specification not turn into a detailed description of how one specialized implementation works. In order to encourage adoption it's also important not to make understanding WACZ too difficult with multiple interlocking specifications. Having one, concise description that covers the core use case of viewing a WACZ would help address these concerns.

I just wanted to go on record with that recommendation. I personally don't think a changing WACZ specification is a concern at this stage since it isn't a standard yet, and we actually want it to change!

edsu · 2022-03-31T19:50:37Z

After some conversation @ikreymer decided that the Multi-WACZ or WACZ Aggregations is best served by a separate specification. This issue can stay open until that new repo & specification exist.

DiegoPino · 2022-03-31T19:54:18Z

I agree with you both @edsu and @ikreymer, also you want to maintain WACZ (for so many reasons) still close to the original Frictionless Data package spec.

ikreymer · 2022-09-09T01:51:29Z

Wanted to come back to this issue, now that we have all the specs in this issue, and list a new use case, and that is multiple WACZ files grouped together, for web-replay-gen to be viewed as part of a static site, but not necessarily merged together.

I think we really need a JSON schema / format that covers several use cases of grouping and declaring WACZ files + and collections

We now have at least the following use cases:

Multiple WACZ files that need to be 'merged' for replay, eg. example.com-part-1.wacz and example.com-part-2.wacz. The above suggestions with a Frictionless Data Package apply to this use case, where the JSON manifest is loaded by replayweb.page. The files should behave the same as if they were one file, but are split for various reasons (eg. each file may be too big, or were simply the output of multiple parallel crawlers). The two files have the same user metadata as well, eg. some pages of example.com are in one, and some are in the other.
Multiple WACZ files that are each self-contained, and can be loaded in replayweb.page individually, and may be presented with separate metadata, and with a distinct viewer page for each one, similar to the examples in https://github.com/webrecorder/example-webarchive. Ideally, the schema for such files can be used to auto-generate a static site like using web-replay-gen where each WACZ represents a distinct 'object'. An existing example (created using a custom schema format) is: https://sup.webrecorder.net/
Combination of 1 and 2, where some WACZ files are part of same object, and some are distinct, eg.

- big-site.example.com crawl
    - big-site.example-1.wacz
    - big-site.example-2.wacz
    
 - another-site.example.com crawl

The schema should probably support collection level metadata (title, description, etc...) as well as an optional list of pages?

Also, may be useful to be able to declare list of WACZ files via some path prefix, eg. s3://bucket/some/path/, eg.

- bit-site.xeample.com crawl
    - s3://bucket/big-site/crawl/

which might tell the tool reading the file to get all the WACZ files in the path prefix. (Will probably want to support different URL schemes for this, including http, s3, local, ipfs, etc...)

It may make sense to split this issue into multiple ones as well, but wanted to jot this down here for now :)

Adds a new specificiation MultiWACZ for describing aggregations of WACZ files. Closes #112

matteocargnelutti · 2023-04-27T17:29:48Z

Hello 👋 !

@ikreymer brought this draft spec to my attention, because it touches on some of the threads I am currently pulling, compiling large collections using a single WACZ file.

I think this draft is excelent, @edsu!

I have a few questions/suggestions, in no particular order.

Properties of `resources` entries

Example currently in the draft:

{
  "profile": "multi-wacz",
  "title": "My WACZ Aggregation",
  "description": "This web archive contains example data for the MultiWACZ specification",
  "created": "2023-01-25T12:00:00.48Z",
  "resources": [
     {
       "name": "Archive 1",
       "path": "https://example.com/archive/archive1.wacz"
     },
     {
       "name": "Archive 2",
       "path": "https://example.com/archive/archive2.wacz"
     }
  ]
}

`hash` and `bytes` properties:

I think the hash and bytes properties should be present, in order for the reader to be able to verify that the WACZ files it's reading are indeed the ones that were originally referenced when creating this multi-wacz manifest.

This might be especially important when pulling remote files, which might have changed in the meantime.

These fields may be optional, but I think the spec needs to make recommendations as to how the reader should behave based on the absence / presence of these fields, and what to do when a check fails.

For example, the spec could encourage showing a confirmation prompt or warning when a hash / byte length check fails.

Checking for remote hashes can sometimes be challenging, which the spec could account for in its recommendations.

`name` vs `title`:

The name properties listed here should maybe be title?
I think we might want to keep name for the filename portion of path.

Two reasons for that:

The Data Package Spec mentions that name must only contain lowercase characters, which seems unsuitable for file "titles".
I think the closest this datapackage.json format is to the one we use for single WACZs, the better.

Packaged vs remote, or hybrids?

Maybe the spec should clarify whether a given list of resources can feature both local and remote files.

If the idea is that "if 1 resource in the list is local, all the resources must be local" (and vice-versa), I think this needs to be spelled out in the spec.
Should the reader bail on the whole list if there is a discrepancy, or ignore entries that are outliers?
Is the first entry of resources the one determining what is expected for the rest of the list?

I am not sure there's a clear usecase for "hybrids" in that context?

Adjusted example:

{
  "profile": "multi-wacz",
  "title": "My WACZ Aggregation",
  "description": "This web archive contains example data for the MultiWACZ specification",
  "created": "2023-01-25T12:00:00.48Z",
  "resources": [
     {
       "title": "Archive 1",
       "name": "archive1.wacz",
       "path": "https://example.com/archive/archive1.wacz",
       "hash": "sha256:8a7fc0d302700bed02294404a627ddbbf0e35487565b1c6181c729dff8d2fff6",
       "bytes": 11469796
     },
     {
       "title": "Archive 2",
       "name": "archive2.wacz",
       "path": "https://example.com/archive/archive2.wacz",
       "hash": "sha256:0e7101316ba5d4b66f86a371ee615fbd20f9d3f32d32563ed2c829db062f7714",
       "bytes": 234664545
     }
  ]
}

ZIP Packaging

"Include information here about compression. Should it be similar to WACZ?"

Yes, I personally think so.
My understanding is that this construction only works if all the WACZ files involved are using STORE compression?

Prevent recursion?

Do we want a multi-wacz to be able to include another multi-wacz as a resource, potentially creating chains of muti-waczs?

If not, the spec should probably specify that all the WACZ files referenced in resources of a multi-wacz must not have the multi-wacz profile.

Signing?

It could be interesting to extend the WACZ Signing spec to multi-wacz. There may be value in signing a compilation of WACZs, even if they are externally sourced.

MultiWacz-level `pages.jsonl`?

I can think of a few use cases for a MultiWACZ-level pages.jsonl, allowing to curate pages coming from multiple WACZs. This could be particularly relevant to large collections that may benefit from curated entry points.

The presence of this file would make it take precedence over all the other pages.jsonl.

Happy to dive into any of these points in more details if there's interest.

Cheers.

edsu added the documentation Improvements or additions to documentation label Mar 7, 2022

edsu self-assigned this Mar 7, 2022

edsu changed the title ~~Multi-WACZ details~~ WACZ Collections Mar 7, 2022

edsu changed the title ~~WACZ Collections~~ WACZ Aggregations Mar 9, 2022

edsu mentioned this issue Mar 16, 2022

Aggregations #117

Closed

ikreymer mentioned this issue May 20, 2022

Support replay of crawls with multiple WACZ files webrecorder/browsertrix#231

Closed

ikreymer changed the title ~~WACZ Aggregations~~ WACZ Aggregation / Multi WACZ Specification Sep 9, 2022

ikreymer mentioned this issue Oct 31, 2022

New Spec: Nested WACZ files? #129

Open

4 tasks

edsu added a commit that referenced this issue Jan 25, 2023

MultiWACZ

b97dc99

Adds a new specificiation MultiWACZ for describing aggregations of WACZ files. Closes #112

edsu added a commit that referenced this issue Jan 25, 2023

MultiWACZ

43e7b26

Adds a new specificiation MultiWACZ for describing aggregations of WACZ files. Closes #112

edsu added a commit that referenced this issue Jan 25, 2023

MultiWACZ

121a831

Adds a new specificiation MultiWACZ for describing aggregations of WACZ files. Closes #112

edsu added a commit that referenced this issue Jan 25, 2023

MultiWACZ

06dc82b

Adds a new specificiation MultiWACZ for describing aggregations of WACZ files. Closes #112

edsu added a commit that referenced this issue Jan 25, 2023

MultiWACZ

4515776

Adds a new specificiation MultiWACZ for describing aggregations of WACZ files. Closes #112

edsu added a commit that referenced this issue Jan 25, 2023

MultiWACZ

86865f6

Adds a new specificiation MultiWACZ for describing aggregations of WACZ files. Closes #112

edsu linked a pull request Feb 9, 2023 that will close this issue

MultiWACZ #135

Draft

Shrinks99 added this to Webrecorder Projects Jul 18, 2024

github-project-automation bot moved this to Triage in Webrecorder Projects Jul 18, 2024

Shrinks99 moved this from Triage to Todo in Webrecorder Projects Jul 18, 2024

Shrinks99 moved this from Todo to Triage in Webrecorder Projects Jul 18, 2024

wvengen mentioned this issue Oct 22, 2024

Support fetching live resources in downloader middleware q-m/scrapy-webarchive#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WACZ Aggregation / Multi WACZ Specification #112

WACZ Aggregation / Multi WACZ Specification #112

edsu commented Mar 7, 2022 •

edited

Loading

ikreymer commented Mar 8, 2022

edsu commented Mar 9, 2022 •

edited

Loading

ikreymer commented Mar 9, 2022

ikreymer commented Mar 9, 2022

edsu commented Mar 9, 2022

edsu commented Mar 9, 2022

DiegoPino commented Mar 9, 2022 via email

ikreymer commented Mar 15, 2022

edsu commented Mar 16, 2022 •

edited

Loading

edsu commented Mar 16, 2022

ikreymer commented Mar 17, 2022

ikreymer commented Mar 17, 2022

edsu commented Mar 18, 2022

edsu commented Mar 31, 2022

DiegoPino commented Mar 31, 2022

ikreymer commented Sep 9, 2022 •

edited

Loading

matteocargnelutti commented Apr 27, 2023 •

edited

Loading

WACZ Aggregation / Multi WACZ Specification #112

WACZ Aggregation / Multi WACZ Specification #112

Comments

edsu commented Mar 7, 2022 • edited Loading

ikreymer commented Mar 8, 2022

edsu commented Mar 9, 2022 • edited Loading

ikreymer commented Mar 9, 2022

ikreymer commented Mar 9, 2022

edsu commented Mar 9, 2022

edsu commented Mar 9, 2022

DiegoPino commented Mar 9, 2022 via email

ikreymer commented Mar 15, 2022

edsu commented Mar 16, 2022 • edited Loading

edsu commented Mar 16, 2022

ikreymer commented Mar 17, 2022

ikreymer commented Mar 17, 2022

edsu commented Mar 18, 2022

edsu commented Mar 31, 2022

DiegoPino commented Mar 31, 2022

ikreymer commented Sep 9, 2022 • edited Loading

matteocargnelutti commented Apr 27, 2023 • edited Loading

Properties of resources entries

Example currently in the draft:

hash and bytes properties:

name vs title:

Packaged vs remote, or hybrids?

Adjusted example:

ZIP Packaging

"Include information here about compression. Should it be similar to WACZ?"

Prevent recursion?

Signing?

MultiWacz-level pages.jsonl?

edsu commented Mar 7, 2022 •

edited

Loading

edsu commented Mar 9, 2022 •

edited

Loading

edsu commented Mar 16, 2022 •

edited

Loading

ikreymer commented Sep 9, 2022 •

edited

Loading

matteocargnelutti commented Apr 27, 2023 •

edited

Loading

Properties of `resources` entries

`hash` and `bytes` properties:

`name` vs `title`:

MultiWacz-level `pages.jsonl`?