Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WARC headers parsing when record has Content-Length: 0 and record after it. #42

Merged
merged 1 commit into from
Nov 14, 2024

Conversation

bocharov
Copy link
Contributor

@bocharov bocharov commented Nov 7, 2024

Validated against Python implementation: https://github.com/webrecorder/warcio

Also fixed flaky test record::verify_display by sorting header names.

@bocharov
Copy link
Contributor Author

Hi @jedireza! I appreciate your review on this PR, thanks!

Copy link
Owner

@jedireza jedireza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating a PR.

Fix WARC headers parsing when record has Content-Length: 0 and record after it.

So is there a bug you're trying to fix? If so, could we first add a failing test case?

Also fixed flaky test record::verify_display by sorting header names.

I didn't know this was a flaky test, but I was able to confirm. I'm not sure we should incur the cost of sort during every display.

Here is another way to make verify_display stable:

  fn verify_display() {
        let header_entries = vec![
            (WarcHeader::WarcType, b"dunno".to_vec()),
            (WarcHeader::Date, b"2024-01-01T00:00:00Z".to_vec()),
        ];

        let headers = RawRecordHeader {
            version: "1.0".to_owned(),
            headers: header_entries.into_iter().collect(),
        };

        let output = headers.to_string();

        let expected_lines = vec![
            "WARC/1.0",
            "warc-type: dunno",
            "warc-date: 2024-01-01T00:00:00Z",
            "",
        ];
        let actual_lines: Vec<_> = output.lines().collect();

        let mut expected_headers: Vec<_> = expected_lines[1..expected_lines.len() - 1].to_vec();
        expected_headers.sort();

        let mut actual_headers: Vec<_> = actual_lines[1..actual_lines.len() - 1].to_vec();
        actual_headers.sort();

        // verify parts
        assert_eq!(actual_lines[0], expected_lines[0]); // WARC version
        assert_eq!(actual_headers, expected_headers); // headers (sorted)
        assert_eq!(actual_lines.last(), expected_lines.last()); // empty line
    }

@bocharov
Copy link
Contributor Author

@jedireza Thanks for coming back on this!

The test case zero_and_nonzero_content_length demonstrates the problem and if you rollback next method original logic let mut found_body = expected_body_len == 0; then it fails.

Regarding verify_display I don't feel strongly about specific way to fix flaky test, so I went with your suggestion sorting headers in the test.

…rd after it.

Validated against Python implementation: https://github.com/webrecorder/warcio

Also fixed flaky test `record::verify_display` by sorting header names in the test.
@bocharov bocharov force-pushed the alex/warc-parser-fixes branch from 7f662f8 to 7d7a2d4 Compare November 12, 2024 21:51
Copy link
Owner

@jedireza jedireza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! This makes sense to me now. Had to reacquaint myself with this code and warc a bit.

@bocharov
Copy link
Contributor Author

Thanks for quick response and PR approval! I'd appreciate if you could merge it and release new version of the crate, so that I can update dependencies in my projects instead of using a fork.

@jedireza
Copy link
Owner

jedireza commented Nov 14, 2024

Going to merge this and make it a patch release.

@jedireza jedireza merged commit 0aac9af into jedireza:master Nov 14, 2024
8 checks passed
@jedireza
Copy link
Owner

Published as 0.3.3

https://crates.io/crates/warc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants