You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using WARC files with non-HTTP traffic, specifically the Gemini protocol. I'm setting the WARC-Content-Type appropriately to reflect this.
warcio check has been helpful to find problems with WARCs such as incorrect block digests or records with invalid content lengths. However warcio check is incorrectly reporting payload digest failure on these records:
If warcio doesn't understand the protocol defined by a record's WARC-Content-Type field (in this case application/gemini; msgtype=response) it won't understand what constitutes the payload for that record, and thus cannot check the WARC-Payload-Digest field. To my knowledge (and a quick check of the source code) warcio has no concept of the Gemini protocol, so I'm unclear on how it would know what the payload is, and whether the digest is valid or not. Section 6.3.3 of the WARC spec even says the contents of a response record isn't defined for non-HTTP URI schemes.
Perhaps I misunderstand what can be in a payload digest header, but reporting payload digest failures for unknown protocols seems like a bug? At the very least it's cluttering the output.
Attached is an example WARC with a request and response records for Gemini. gemini.warc.gz
Without getting too detailed, Gemini protocol responses contain a single response line with a status code and MIME type, a single CRLF, and then the body of the response. This body is the gemini equivalent of HTTP's entity-body per section 6.3.2. In the WARC example, the body of the response begins at offset 1338 in the uncompressed version of the WARC file (with the '#' character). The body ends at the end of the record, before the final, double CRLF, signifying the end of the record. The sha256 for this body is 20670b53ae319b676698eb1aec228b492328574d78c1425b6b68a77876763403 which is used in the payload digest field so I can do deduping and generate indexes.
My suggestion would be that warcio check should not check the payload digest for records whose WARC-Content-Type is an unknown protocol. This would allow future PRs to warcio that support other protocols.
The text was updated successfully, but these errors were encountered:
Perhaps there is another discussion to be had here on "iipc/warc-specifications". I would suggest that payload definition shouldn't be codified into the WARC spec. Tools should be able to work with new protocols and payloads and shouldn't make assumptions about what constitutes a payload for protocls/URIs they don't understand.
I'm using WARC files with non-HTTP traffic, specifically the Gemini protocol. I'm setting the
WARC-Content-Type
appropriately to reflect this.warcio check
has been helpful to find problems with WARCs such as incorrect block digests or records with invalid content lengths. Howeverwarcio check
is incorrectly reporting payload digest failure on these records:If warcio doesn't understand the protocol defined by a record's
WARC-Content-Type
field (in this caseapplication/gemini; msgtype=response
) it won't understand what constitutes the payload for that record, and thus cannot check theWARC-Payload-Digest
field. To my knowledge (and a quick check of the source code) warcio has no concept of the Gemini protocol, so I'm unclear on how it would know what the payload is, and whether the digest is valid or not. Section 6.3.3 of the WARC spec even says the contents of a response record isn't defined for non-HTTP URI schemes.Perhaps I misunderstand what can be in a payload digest header, but reporting payload digest failures for unknown protocols seems like a bug? At the very least it's cluttering the output.
Attached is an example WARC with a request and response records for Gemini.
gemini.warc.gz
Without getting too detailed, Gemini protocol responses contain a single response line with a status code and MIME type, a single CRLF, and then the body of the response. This body is the gemini equivalent of HTTP's
entity-body
per section 6.3.2. In the WARC example, the body of the response begins at offset 1338 in the uncompressed version of the WARC file (with the '#' character). The body ends at the end of the record, before the final, double CRLF, signifying the end of the record. The sha256 for this body is20670b53ae319b676698eb1aec228b492328574d78c1425b6b68a77876763403
which is used in the payload digest field so I can do deduping and generate indexes.My suggestion would be that
warcio check
should not check the payload digest for records whoseWARC-Content-Type
is an unknown protocol. This would allow future PRs to warcio that support other protocols.The text was updated successfully, but these errors were encountered: