WARC-Payload-Digest should only be written for HTTP records #93

JustAnotherArchivist · 2019-09-06T15:41:04Z

The WARC/1.1 specification states that:

The WARC-Payload-Digest field may be used on WARC records with a well-defined payload and shall not be used on records without a well-defined payload. (Section 5.9)

While a payload can certainly be defined for other data as well, the spec only does so for HTTP (cf. #74). However, warcio writes a payload digest indiscriminately for any record that isn't a warcinfo or revisit. I'm writing resource records with a number of content types which don't have a payload in the HTTP sense, including application/x-python, application/octet-stream, and text/plain. Of course, in principle, one could also write request and response records for something else than HTTP (e.g. DNS queries) which may or may not have a "well-defined payload".

I think that warcio should only write the payload digest for records with an HTTP Content-Type header.

The text was updated successfully, but these errors were encountered:

ikreymer · 2019-10-11T01:33:26Z

The field is useful to have to allow revisit records, so could have a revisit of a resource or metadata record, for example.

The revisit record also has the payload digest, which matches that of the original.

The example resource record actually includes the WARC-Payload-Digest:
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1-1_latestdraft.pdf

Don't think any changes are needed here, and not really sure what the 'well-defined payload' is supposed to mean (when is a payload not well defined?)

wumpus · 2019-10-11T05:24:17Z

My testing code has a list of record types that have a well-defined payload, in the WARC sense. I don't think it means an http payload.

optional: warcinfo, response, resource, request, revisit, conversion
prohibited for: metadata, continuation

One of the points of this testing is to flush out disagreements about what the standard says.

JustAnotherArchivist · 2019-10-11T10:09:41Z

Good point, revisit records for resource and metadata duplicates would indeed be useful.

I'm not sure either what "record with a well-defined payload" is supposed to mean exactly. I interpret it as "a record that has a Content-Type for which a definition of 'payload' has been specified". If that interpretation is correct, only HTTP records should have a payload digest since that's the only definition given in the spec. If it instead means "a record that contains data for which there is a common understanding of what its payload is", then I agree that it would also cover a number of other content types. Perhaps a discussion on https://github.com/iipc/warc-specifications is in order here for the details.

That said, I believe there is an issue here. I should probably have been more specific in the original report. In qwarc, I write all dependencies of the crawl to a metadata WARC using resource records. These dependencies include a Python script and may also include arbitrary files used by the script. The script is written using an application/x-python content type, which is not officially defined anywhere but common enough that it's clear what its contents – and therefore its payload – is. Since it is impossible to reliably guess the content type, qwarc doesn't attempt to do so for the files; instead, they are written using application/octet-stream. Now, I suppose there are (at least) two ways one could look at this type. The first is that it's a generic type that on its own doesn't have any meaning since it could be any type of data. This is what I see it as, and by this interpretation, I don't think it can be considered having a "well-defined" payload. An alternative view is that application/octet-stream is a container; while it may contain any type of data, on its own, it's just a stream of bytes, and as such, it does have a well-defined payload.
(Minor addition: per RFC 2046, application/octet-stream may also have padding, which I think should not be considered part of the payload even if it is seen as a container type. But since this is only about bit padding to full bytes, I don't think that's of concern.)

Another scenario: what if someone stores a WARC record within a WARC record? An example where this could happen is if a crawler writes a resource record in addition to request/response records, which contains the decoded HTTP body (i.e. transfer encoding removed etc.). There is at least one tool (crocoite) which does something like this with HTML pages, storing the rendered DOM in a resource record, so this is not unreasonable. What would the payload be in that case?

ikreymer closed this as completed Oct 11, 2019

acidus99 mentioned this issue Aug 24, 2023

"warcio check" incorrectly reporting payload digest failures for non-HTTP WARCs #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARC-Payload-Digest should only be written for HTTP records #93

WARC-Payload-Digest should only be written for HTTP records #93

JustAnotherArchivist commented Sep 6, 2019

ikreymer commented Oct 11, 2019

wumpus commented Oct 11, 2019

JustAnotherArchivist commented Oct 11, 2019

WARC-Payload-Digest should only be written for HTTP records #93

WARC-Payload-Digest should only be written for HTTP records #93

Comments

JustAnotherArchivist commented Sep 6, 2019

ikreymer commented Oct 11, 2019

wumpus commented Oct 11, 2019

JustAnotherArchivist commented Oct 11, 2019