Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: parse headers in blocks and scan for magic numbers with memchr #93

Merged
merged 50 commits into from
May 25, 2024

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented May 3, 2024

Motivation

Parsing in Blocks

Aesthetics

We currently parse the zip binary format with a series of e.g. .read_u32_le() calls. While this does work, I believe this approach obscures some of the control flow and makes it more difficult to correlate to the zip specification.

Performance

Additionally, when reading a zip file from a non-buffered stream, I believe a series of individual reads may take longer than a single bulk read (I have not verified this).

Scanning with memchr

We currently perform an extremely inefficient byte-by-byte loop to locate central directory records, e.g.:

zip2/src/spec.rs

Lines 66 to 80 in 103f1ec

while pos >= search_upper_bound {
reader.seek(io::SeekFrom::Start(pos))?;
if reader.read_u32_le()? == CENTRAL_DIRECTORY_END_SIGNATURE {
reader.seek(io::SeekFrom::Current(
BYTES_BETWEEN_MAGIC_AND_COMMENT_SIZE as i64,
))?;
let cde_start_pos = reader.seek(io::SeekFrom::Start(pos))?;
if let Ok(end_header) = CentralDirectoryEnd::parse(reader) {
return Ok((end_header, cde_start_pos));
}
}
pos = match pos.checked_sub(1) {
Some(p) => p,
None => break,
};

We can lean on string scanning utilities like memchr in order to speed up this loop that occurs whenever we read a zip file!

Implementation

  • Create *Block structs declared with #[repr(packed)] so we can extract metadata using bulk reads.
  • Create a Block trait with several utility methods for the different types of "block" objects introduced in this PR.
  • Use memchr to scan for central directory magic numbers.
  • Add benchmarks to demonstrate the speedup when reading zip files with large comment sections.

Result

We get a very noticeable speedup in the new benchmarks!

> git checkout HEAD~1
> cargo bench -- parse
...
running 2 tests
test parse_comment       ... bench:      42,446 ns/iter (+/- 704)
test parse_zip64_comment ... bench:     400,349 ns/iter (+/- 16,882)
> git checkout -
> cargo bench -- parse
...
running 2 tests
test parse_comment       ... bench:      15,307 ns/iter (+/- 682)
test parse_zip64_comment ... bench:     153,766 ns/iter (+/- 9,240)

Note that the benchmarks have been renamed upon review, so the output looks more like this:

test parse_archive_with_comment       ... bench:      16,403 ns/iter (+/- 1,177) = 3055 MB/s
test parse_zip64_archive_with_comment ... bench:     167,784 ns/iter (+/- 7,847) = 2980 MB/s

src/spec.rs Outdated Show resolved Hide resolved
src/spec.rs Show resolved Hide resolved
src/spec.rs Show resolved Hide resolved
src/spec.rs Outdated Show resolved Hide resolved
src/read.rs Outdated Show resolved Hide resolved
src/read/stream.rs Outdated Show resolved Hide resolved
src/spec.rs Fixed Show fixed Hide fixed
src/spec.rs Outdated Show resolved Hide resolved
@Pr0methean Pr0methean added the Please address review comments Some review comments are still open. label May 3, 2024
@Pr0methean
Copy link
Member

Thanks for the update; but I see a number of my comments from earlier are still unresolved. Any clue how soon you'll be able to address them?

@cosmicexplorer
Copy link
Contributor Author

I've since refactored this to cover your comments. The current implementation is not quite complete, but please let me know if this is easier to follow.

src/spec.rs Outdated Show resolved Hide resolved
src/read.rs Outdated Show resolved Hide resolved
src/spec.rs Outdated Show resolved Hide resolved
src/spec.rs Outdated Show resolved Hide resolved
src/spec.rs Outdated Show resolved Hide resolved
src/spec.rs Show resolved Hide resolved
src/read.rs Outdated Show resolved Hide resolved
src/write.rs Fixed Show fixed Hide fixed
@Pr0methean
Copy link
Member

Please feel free to revert cc4fe7f if it helps, as long as if you delete changes then you add them back in a merge PR.

@cosmicexplorer
Copy link
Contributor Author

@Pr0methean:

as long as if you delete changes then you add them back in a merge PR.

I just rebased against master and force pushed, which deleted your merge history. I believe the merge commit was just a bump against master, and not any additional work you added on top of my changes, so I'm not missing anything at fa2c8e1? Just making sure!

src/spec.rs Fixed Show fixed Hide fixed
src/spec.rs Fixed Show fixed Hide fixed
@cosmicexplorer cosmicexplorer force-pushed the bulk-parsing branch 6 times, most recently from 020a308 to 3e04004 Compare May 4, 2024 08:06
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
@Pr0methean Pr0methean enabled auto-merge May 24, 2024 20:11
Signed-off-by: Chris Hennick <[email protected]>
@Pr0methean Pr0methean added waiting to be automatically merged If this PR isn't automatically merged, suspect a broader issue. and removed Please address review comments Some review comments are still open. labels May 24, 2024
src/write.rs Outdated Show resolved Hide resolved
src/write.rs Outdated Show resolved Hide resolved
src/write.rs Outdated Show resolved Hide resolved
Fix errors

Signed-off-by: Chris Hennick <[email protected]>
@Pr0methean Pr0methean added this pull request to the merge queue May 25, 2024
Merged via the queue into zip-rs:master with commit b057d0d May 25, 2024
38 checks passed
@Pr0methean Pr0methean removed the waiting to be automatically merged If this PR isn't automatically merged, suspect a broader issue. label May 28, 2024
@caldwell caldwell mentioned this pull request Oct 23, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major This >~1000 SLOC PR is likely to cause hard-to-resolve conflicts with other open PRs once merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants