-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC-Payload-Digest should only be written for HTTP records #93
Comments
The field is useful to have to allow revisit records, so could have a revisit of a resource or metadata record, for example. The revisit record also has the payload digest, which matches that of the original. The example Don't think any changes are needed here, and not really sure what the 'well-defined payload' is supposed to mean (when is a payload not well defined?) |
My testing code has a list of record types that have a well-defined payload, in the WARC sense. I don't think it means an http payload.
One of the points of this testing is to flush out disagreements about what the standard says. |
Good point, revisit records for resource and metadata duplicates would indeed be useful. I'm not sure either what "record with a well-defined payload" is supposed to mean exactly. I interpret it as "a record that has a Content-Type for which a definition of 'payload' has been specified". If that interpretation is correct, only HTTP records should have a payload digest since that's the only definition given in the spec. If it instead means "a record that contains data for which there is a common understanding of what its payload is", then I agree that it would also cover a number of other content types. Perhaps a discussion on https://github.com/iipc/warc-specifications is in order here for the details. That said, I believe there is an issue here. I should probably have been more specific in the original report. In qwarc, I write all dependencies of the crawl to a metadata WARC using resource records. These dependencies include a Python script and may also include arbitrary files used by the script. The script is written using an Another scenario: what if someone stores a WARC record within a WARC record? An example where this could happen is if a crawler writes a resource record in addition to request/response records, which contains the decoded HTTP body (i.e. transfer encoding removed etc.). There is at least one tool (crocoite) which does something like this with HTML pages, storing the rendered DOM in a resource record, so this is not unreasonable. What would the payload be in that case? |
The WARC/1.1 specification states that:
While a payload can certainly be defined for other data as well, the spec only does so for HTTP (cf. #74). However, warcio writes a payload digest indiscriminately for any record that isn't a
warcinfo
orrevisit
. I'm writingresource
records with a number of content types which don't have a payload in the HTTP sense, includingapplication/x-python
,application/octet-stream
, andtext/plain
. Of course, in principle, one could also writerequest
andresponse
records for something else than HTTP (e.g. DNS queries) which may or may not have a "well-defined payload".I think that warcio should only write the payload digest for records with an HTTP
Content-Type
header.The text was updated successfully, but these errors were encountered: