Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC-Payload-Digest should only be written for HTTP records #93

Closed
JustAnotherArchivist opened this issue Sep 6, 2019 · 3 comments
Closed

Comments

@JustAnotherArchivist
Copy link
Contributor

The WARC/1.1 specification states that:

The WARC-Payload-Digest field may be used on WARC records with a well-defined payload and shall not be used on records without a well-defined payload. (Section 5.9)

While a payload can certainly be defined for other data as well, the spec only does so for HTTP (cf. #74). However, warcio writes a payload digest indiscriminately for any record that isn't a warcinfo or revisit. I'm writing resource records with a number of content types which don't have a payload in the HTTP sense, including application/x-python, application/octet-stream, and text/plain. Of course, in principle, one could also write request and response records for something else than HTTP (e.g. DNS queries) which may or may not have a "well-defined payload".

I think that warcio should only write the payload digest for records with an HTTP Content-Type header.

@ikreymer
Copy link
Member

The field is useful to have to allow revisit records, so could have a revisit of a resource or metadata record, for example.

The revisit record also has the payload digest, which matches that of the original.

The example resource record actually includes the WARC-Payload-Digest:
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1-1_latestdraft.pdf

Don't think any changes are needed here, and not really sure what the 'well-defined payload' is supposed to mean (when is a payload not well defined?)

@wumpus
Copy link
Collaborator

wumpus commented Oct 11, 2019

My testing code has a list of record types that have a well-defined payload, in the WARC sense. I don't think it means an http payload.

  • optional: warcinfo, response, resource, request, revisit, conversion
  • prohibited for: metadata, continuation

One of the points of this testing is to flush out disagreements about what the standard says.

@JustAnotherArchivist
Copy link
Contributor Author

Good point, revisit records for resource and metadata duplicates would indeed be useful.

I'm not sure either what "record with a well-defined payload" is supposed to mean exactly. I interpret it as "a record that has a Content-Type for which a definition of 'payload' has been specified". If that interpretation is correct, only HTTP records should have a payload digest since that's the only definition given in the spec. If it instead means "a record that contains data for which there is a common understanding of what its payload is", then I agree that it would also cover a number of other content types. Perhaps a discussion on https://github.com/iipc/warc-specifications is in order here for the details.

That said, I believe there is an issue here. I should probably have been more specific in the original report. In qwarc, I write all dependencies of the crawl to a metadata WARC using resource records. These dependencies include a Python script and may also include arbitrary files used by the script. The script is written using an application/x-python content type, which is not officially defined anywhere but common enough that it's clear what its contents – and therefore its payload – is. Since it is impossible to reliably guess the content type, qwarc doesn't attempt to do so for the files; instead, they are written using application/octet-stream. Now, I suppose there are (at least) two ways one could look at this type. The first is that it's a generic type that on its own doesn't have any meaning since it could be any type of data. This is what I see it as, and by this interpretation, I don't think it can be considered having a "well-defined" payload. An alternative view is that application/octet-stream is a container; while it may contain any type of data, on its own, it's just a stream of bytes, and as such, it does have a well-defined payload.
(Minor addition: per RFC 2046, application/octet-stream may also have padding, which I think should not be considered part of the payload even if it is seen as a container type. But since this is only about bit padding to full bytes, I don't think that's of concern.)

Another scenario: what if someone stores a WARC record within a WARC record? An example where this could happen is if a crawler writes a resource record in addition to request/response records, which contains the decoded HTTP body (i.e. transfer encoding removed etc.). There is at least one tool (crocoite) which does something like this with HTML pages, storing the rendered DOM in a resource record, so this is not unreasonable. What would the payload be in that case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants