Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WACZ Aggregation / Multi WACZ Specification #112

Open
edsu opened this issue Mar 7, 2022 · 17 comments · May be fixed by #135
Open

WACZ Aggregation / Multi WACZ Specification #112

edsu opened this issue Mar 7, 2022 · 17 comments · May be fixed by #135
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@edsu
Copy link
Collaborator

edsu commented Mar 7, 2022

Details about how to aggregate multiple WACZ files into a single WACZ need to be added to the specification. This hinges on resources in the datapackage.json using a url for a WACZ rather than a path. See the Resource Information section in the Data Package specification for details:

{
   "resources": [
      {"hash": "...", "url": "https://example.com/filename_1.wacz", "bytes": "..."}
      {"hash": "...", "url": "https://example.com/filename_2.wacz", "bytes": "..."}
   ]
   ...
}

There should also be a Data Package profile so that clients can easily distinguish between collections and regular WACZ files. Perhaps WACZ-Aggregation?

The specification should document that WACZ users MAY want to use the data-package.json as a place to record additional metadata about crawls. See the browsertrix-cloud API for examples.

@edsu edsu added the documentation Improvements or additions to documentation label Mar 7, 2022
@edsu edsu self-assigned this Mar 7, 2022
@edsu edsu changed the title Multi-WACZ details WACZ Collections Mar 7, 2022
@ikreymer
Copy link
Member

ikreymer commented Mar 8, 2022

I wonder if this should be a separate specification, to avoid confusion. It is no longer a WACZ, which is a specific format, but a collection of WACZ, probably would be just a .json file (the datapackage.json) to start, which is a different specification, but of course depends on the WACZ.

The collection spec is sort of independent from the file format spec imo.

@edsu
Copy link
Collaborator Author

edsu commented Mar 9, 2022

I don't think it should be separate. Unless I am misunderstanding (which is very possible) an understanding of a WACZ Aggregation without an understanding of WACZ would be pretty much useless. Why make it more difficult to manage as two separate specifications?

@ikreymer
Copy link
Member

ikreymer commented Mar 9, 2022

I don't think it should be separate. Unless I am misunderstanding (which is very possible) an understanding of a WACZ Collection without an understanding of WACZ would be pretty much useless. Why make it more difficult to manage as two separate specifications?

Well, it's really a completely different format, an aggregate format of wacz (and possibly other types) to form a collection.
The only overlap is that currently thinking of this also as a frictionless data package (though maybe that doesn't make sense given that it uses a url instead of an internal path).

Maybe it should really be called a 'web archive collection', which consists of collection-level data.

For example, maybe a collection could have a mix of wacz and regular warc files, which could be both be listed in the resources section. As conceived now, it's just a single json file, though could also change.

@ikreymer
Copy link
Member

ikreymer commented Mar 9, 2022

Another aspect that the 'WACZ aggregate' could have is a page-to-wacz mapping, along with resources, for example:

{
  "resources": [{
     "name": "example-com-crawl",
     "url": "https://store.example.org/example-com-crawl.wacz",
     "hash": ...
     "bytes": ...
   }, {
      "name": "another-site-com-crawl",
     "url": "https://store.example.org/another-site-com-crawl.wacz",
     "hash": ...
     "bytes": ...
    }],
    
   "pages": [
      "https://example.com/": {"filename": "example-com-crawl"},
      "https://another-site.example.com": {"filename":  "another-site-com-crawl"}
    ]
  }

This would help route the page to the correct wacz file, otherwise, would need to search through all of them..

@edsu edsu changed the title WACZ Collections WACZ Aggregations Mar 9, 2022
@edsu
Copy link
Collaborator Author

edsu commented Mar 9, 2022

I like WACZ Aggregation better than WACZ Collection, and have just updated some of the text above.

@edsu
Copy link
Collaborator Author

edsu commented Mar 9, 2022

I guess my point is that if someone is building a WACZ client, as we want people to do, they will want to be clear about what their viewer needs to do. I think the easiest way to communicate that is with a single specification about what WACZ support means. If implementors need to digest multiple specifications I think we will risk losing them.

@DiegoPino
Copy link

DiegoPino commented Mar 9, 2022 via email

@ikreymer
Copy link
Member

Hi. Maybe something like a iiif collection manifest (so just a json or json-ld) might make sense here?

Hm, that's an interesting idea.. yeah, its basically a collection manifest for multiple WACZs (and possibly WARCs), maybe that's the best way to look at it..

@edsu
Copy link
Collaborator Author

edsu commented Mar 16, 2022

That's a useful comparison @DiegoPino!

It might be helpful to think about the WACZ specification as the equivalent of the Image API and this Aggregated WACZ view as the equivalent of the Presentation API. I could even imagine expressing the Aggregated WACZ as a IIIF Manifest since the IIIF Presentation API is oriented around the abstract idea of Canvases that can include images, video, audio. Why not a web archive too? But I fear that this might be a bit of a leaky abstraction because the sequencing of WACZs doesn't make a lot of sense in this context? Also the WACZ itself contains a lot of presentation metadata itself.

While the separation between Image and Presentation APIs in IIIF allows it to be more general I also think it was because they were developed separately in time. As someone who had to implement support for them at one point I can say that understanding and tracking them as separate specifications sometimes proved challenging. But I suspect other people may feel differently about that.

I think what you've identified here is that there's a bit of slippage between the WACZ media type (the ZIP file on the web) and WACZ as an API, which we see developing in tools like Browsertrix Cloud and which relates to other work like WASAPI. I also think this crops up in the Unzipped WACZ use case...

@edsu
Copy link
Collaborator Author

edsu commented Mar 16, 2022

I noticed an inconsistency in the Frictionless DataPackage specification around URLs in Resources.

The Data Resource specification says:

path MUST be a string – or an array of strings (see “Data in Multiple
Files”). Each string MUST be a “url-or-path” as defined in the next section.

whereas another Data Package specification distinguishes between url and path. I think we should use the former (url-or-path) since it maps to implementations we have already, and the url usage seems to be from an earlier version of the DataPackage specification.

edsu added a commit that referenced this issue Mar 16, 2022
This commit adds some minimal information about aggregations. It is
likely we will need to revisit this section as implementations start to
use it. I think it will also relate to the TBD section in the spec on
"unzipped" WACZ #96.

Closes #112
@edsu edsu mentioned this issue Mar 16, 2022
@ikreymer
Copy link
Member

Yes, that may be a good comparison. Thinking about it more, I think aggregation should really be a separate spec, but in this repo for now, because:

  • The aggregation spec is still in development, and should be versioned separately
  • Tools can work with single WACZ files without needing to support aggregation, or vice versa.
  • Different properties: WACZ file is designed to be fixed, while aggregations will be updated to include more WACZ files

I think it'll be really confusing to combine the more experimental aggregation format into the core format specification, which is already in use. I think aggregations should be a separate file, similar to use-cases, and probably have its own version for now - I imagine it'll need a bit more iteration before its put into use.

That's a useful comparison @DiegoPino!

It might be helpful to think about the WACZ specification as the equivalent of the Image API and this Aggregated WACZ view as the equivalent of the Presentation API. I could even imagine expressing the Aggregated WACZ as a IIIF Manifest since the IIIF Presentation API is oriented around the abstract idea of Canvases that can include images, video, audio. Why not a web archive too? But I fear that this might be a bit of a leaky abstraction because the sequencing of WACZs doesn't make a lot of sense in this context? Also the WACZ itself contains a lot of presentation metadata itself.

While the separation between Image and Presentation APIs in IIIF allows it to be more general I also think it was because they were developed separately in time. As someone who had to implement support for them at one point I can say that understanding and tracking them as separate specifications sometimes proved challenging. But I suspect other people may feel differently about that.

I think what you've identified here is that there's a bit of slippage between the WACZ media type (the ZIP file on the web) and WACZ as an API, which we see developing in tools like Browsertrix Cloud and which relates to other work like WASAPI. I also think this crops up in the Unzipped WACZ use case...

@ikreymer
Copy link
Member

Not to the expand the immediate scope, but maybe this really needs its own repo, wacz-aggregation-spec.
I imagine there'll be a host of use cases

  • aggregation for wacz from a single crawl
  • aggregation for wacz across multiple crawls / collections of crawls
  • nested aggregation, eg. a collection aggregation that points to a list of crawl aggregations
  • immutable vs mutable, so can put an aggregation on IPFS and reference from other aggregations
    We certainly don't have to figure this out now, just have a plan for how to tackle it in the future!

Yes, that may be a good comparison. Thinking about it more, I think aggregation should really be a separate spec, but in this repo for now, because:

  • The aggregation spec is still in development, and should be versioned separately
  • Tools can work with single WACZ files without needing to support aggregation, or vice versa.
  • Different properties: WACZ file is designed to be fixed, while aggregations will be updated to include more WACZ files

@edsu
Copy link
Collaborator Author

edsu commented Mar 18, 2022

If aggregation isn't core to using WACZ then I agree it should be a separate specification, or perhaps just be in the description of how a specific tool works, and not a standard at all.

The purpose of positioning WACZ as a standard is to help others who are creating tools that use WACZ (crawlers, viewers, indexers, aggregators, etc). The specification should document the minimal requirements that allow developers to do that work, and also provide flexibility for them to add things that they need.

If it is to be a standard it's important that the WACZ specification not turn into a detailed description of how one specialized implementation works. In order to encourage adoption it's also important not to make understanding WACZ too difficult with multiple interlocking specifications. Having one, concise description that covers the core use case of viewing a WACZ would help address these concerns.

I just wanted to go on record with that recommendation. I personally don't think a changing WACZ specification is a concern at this stage since it isn't a standard yet, and we actually want it to change!

@edsu
Copy link
Collaborator Author

edsu commented Mar 31, 2022

After some conversation @ikreymer decided that the Multi-WACZ or WACZ Aggregations is best served by a separate specification. This issue can stay open until that new repo & specification exist.

@DiegoPino
Copy link

I agree with you both @edsu and @ikreymer, also you want to maintain WACZ (for so many reasons) still close to the original Frictionless Data package spec.

@ikreymer ikreymer changed the title WACZ Aggregations WACZ Aggregation / Multi WACZ Specification Sep 9, 2022
@ikreymer
Copy link
Member

ikreymer commented Sep 9, 2022

Wanted to come back to this issue, now that we have all the specs in this issue, and list a new use case, and that is multiple WACZ files grouped together, for web-replay-gen to be viewed as part of a static site, but not necessarily merged together.

I think we really need a JSON schema / format that covers several use cases of grouping and declaring WACZ files + and collections

We now have at least the following use cases:

  1. Multiple WACZ files that need to be 'merged' for replay, eg. example.com-part-1.wacz and example.com-part-2.wacz. The above suggestions with a Frictionless Data Package apply to this use case, where the JSON manifest is loaded by replayweb.page. The files should behave the same as if they were one file, but are split for various reasons (eg. each file may be too big, or were simply the output of multiple parallel crawlers). The two files have the same user metadata as well, eg. some pages of example.com are in one, and some are in the other.

  2. Multiple WACZ files that are each self-contained, and can be loaded in replayweb.page individually, and may be presented with separate metadata, and with a distinct viewer page for each one, similar to the examples in https://github.com/webrecorder/example-webarchive. Ideally, the schema for such files can be used to auto-generate a static site like using web-replay-gen where each WACZ represents a distinct 'object'. An existing example (created using a custom schema format) is: https://sup.webrecorder.net/

  3. Combination of 1 and 2, where some WACZ files are part of same object, and some are distinct, eg.

- big-site.example.com crawl
    - big-site.example-1.wacz
    - big-site.example-2.wacz
    
 - another-site.example.com crawl

The schema should probably support collection level metadata (title, description, etc...) as well as an optional list of pages?

Also, may be useful to be able to declare list of WACZ files via some path prefix, eg. s3://bucket/some/path/, eg.

- bit-site.xeample.com crawl
    - s3://bucket/big-site/crawl/

which might tell the tool reading the file to get all the WACZ files in the path prefix. (Will probably want to support different URL schemes for this, including http, s3, local, ipfs, etc...)

It may make sense to split this issue into multiple ones as well, but wanted to jot this down here for now :)

edsu added a commit that referenced this issue Jan 25, 2023
Adds a new specificiation MultiWACZ for describing aggregations of WACZ
files.

Closes #112
edsu added a commit that referenced this issue Jan 25, 2023
Adds a new specificiation MultiWACZ for describing aggregations of WACZ
files.

Closes #112
edsu added a commit that referenced this issue Jan 25, 2023
Adds a new specificiation MultiWACZ for describing aggregations of WACZ
files.

Closes #112
edsu added a commit that referenced this issue Jan 25, 2023
Adds a new specificiation MultiWACZ for describing aggregations of WACZ
files.

Closes #112
edsu added a commit that referenced this issue Jan 25, 2023
Adds a new specificiation MultiWACZ for describing aggregations of WACZ
files.

Closes #112
edsu added a commit that referenced this issue Jan 25, 2023
Adds a new specificiation MultiWACZ for describing aggregations of WACZ
files.

Closes #112
@edsu edsu linked a pull request Feb 9, 2023 that will close this issue
@matteocargnelutti
Copy link
Contributor

matteocargnelutti commented Apr 27, 2023

Hello 👋 !

@ikreymer brought this draft spec to my attention, because it touches on some of the threads I am currently pulling, compiling large collections using a single WACZ file.

I think this draft is excelent, @edsu!

I have a few questions/suggestions, in no particular order.


Properties of resources entries

Example currently in the draft:

{
  "profile": "multi-wacz",
  "title": "My WACZ Aggregation",
  "description": "This web archive contains example data for the MultiWACZ specification",
  "created": "2023-01-25T12:00:00.48Z",
  "resources": [
     {
       "name": "Archive 1",
       "path": "https://example.com/archive/archive1.wacz"
     },
     {
       "name": "Archive 2",
       "path": "https://example.com/archive/archive2.wacz"
     }
  ]
}

hash and bytes properties:

I think the hash and bytes properties should be present, in order for the reader to be able to verify that the WACZ files it's reading are indeed the ones that were originally referenced when creating this multi-wacz manifest.

This might be especially important when pulling remote files, which might have changed in the meantime.

These fields may be optional, but I think the spec needs to make recommendations as to how the reader should behave based on the absence / presence of these fields, and what to do when a check fails.

For example, the spec could encourage showing a confirmation prompt or warning when a hash / byte length check fails.

Checking for remote hashes can sometimes be challenging, which the spec could account for in its recommendations.

name vs title:

The name properties listed here should maybe be title?
I think we might want to keep name for the filename portion of path.

Two reasons for that:

  • The Data Package Spec mentions that name must only contain lowercase characters, which seems unsuitable for file "titles".
  • I think the closest this datapackage.json format is to the one we use for single WACZs, the better.

Packaged vs remote, or hybrids?

Maybe the spec should clarify whether a given list of resources can feature both local and remote files.

If the idea is that "if 1 resource in the list is local, all the resources must be local" (and vice-versa), I think this needs to be spelled out in the spec.
Should the reader bail on the whole list if there is a discrepancy, or ignore entries that are outliers?
Is the first entry of resources the one determining what is expected for the rest of the list?

I am not sure there's a clear usecase for "hybrids" in that context?

Adjusted example:

{
  "profile": "multi-wacz",
  "title": "My WACZ Aggregation",
  "description": "This web archive contains example data for the MultiWACZ specification",
  "created": "2023-01-25T12:00:00.48Z",
  "resources": [
     {
       "title": "Archive 1",
       "name": "archive1.wacz",
       "path": "https://example.com/archive/archive1.wacz",
       "hash": "sha256:8a7fc0d302700bed02294404a627ddbbf0e35487565b1c6181c729dff8d2fff6",
       "bytes": 11469796
     },
     {
       "title": "Archive 2",
       "name": "archive2.wacz",
       "path": "https://example.com/archive/archive2.wacz",
       "hash": "sha256:0e7101316ba5d4b66f86a371ee615fbd20f9d3f32d32563ed2c829db062f7714",
       "bytes": 234664545
     }
  ]
}

ZIP Packaging

"Include information here about compression. Should it be similar to WACZ?"

Yes, I personally think so.
My understanding is that this construction only works if all the WACZ files involved are using STORE compression?


Prevent recursion?

Do we want a multi-wacz to be able to include another multi-wacz as a resource, potentially creating chains of muti-waczs?

If not, the spec should probably specify that all the WACZ files referenced in resources of a multi-wacz must not have the multi-wacz profile.


Signing?

It could be interesting to extend the WACZ Signing spec to multi-wacz. There may be value in signing a compilation of WACZs, even if they are externally sourced.


MultiWacz-level pages.jsonl?

I can think of a few use cases for a MultiWACZ-level pages.jsonl, allowing to curate pages coming from multiple WACZs. This could be particularly relevant to large collections that may benefit from curated entry points.

The presence of this file would make it take precedence over all the other pages.jsonl.


Happy to dive into any of these points in more details if there's interest.

Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Triage
Development

Successfully merging a pull request may close this issue.

4 participants