-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: parse headers in blocks and scan for magic numbers with memchr #93
Conversation
9437681
to
0c2def2
Compare
867915c
to
00acf23
Compare
00acf23
to
f940c2b
Compare
Thanks for the update; but I see a number of my comments from earlier are still unresolved. Any clue how soon you'll be able to address them? |
7819dd7
to
f70aef4
Compare
I've since refactored this to cover your comments. The current implementation is not quite complete, but please let me know if this is easier to follow. |
2f74396
to
180bcdb
Compare
Please feel free to revert cc4fe7f if it helps, as long as if you delete changes then you add them back in a merge PR. |
cc4fe7f
to
fa2c8e1
Compare
I just rebased against master and force pushed, which deleted your merge history. I believe the merge commit was just a bump against master, and not any additional work you added on top of my changes, so I'm not missing anything at fa2c8e1? Just making sure! |
9f579f4
to
93a7bf0
Compare
020a308
to
3e04004
Compare
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Signed-off-by: Chris Hennick <[email protected]>
Fix errors Signed-off-by: Chris Hennick <[email protected]>
Motivation
Parsing in Blocks
Aesthetics
We currently parse the zip binary format with a series of e.g.
.read_u32_le()
calls. While this does work, I believe this approach obscures some of the control flow and makes it more difficult to correlate to the zip specification.Performance
Additionally, when reading a zip file from a non-buffered stream, I believe a series of individual reads may take longer than a single bulk read (I have not verified this).
Scanning with
memchr
We currently perform an extremely inefficient byte-by-byte loop to locate central directory records, e.g.:
zip2/src/spec.rs
Lines 66 to 80 in 103f1ec
We can lean on string scanning utilities like
memchr
in order to speed up this loop that occurs whenever we read a zip file!Implementation
*Block
structs declared with#[repr(packed)]
so we can extract metadata using bulk reads.Block
trait with several utility methods for the different types of "block" objects introduced in this PR.memchr
to scan for central directory magic numbers.Result
We get a very noticeable speedup in the new benchmarks!
Note that the benchmarks have been renamed upon review, so the output looks more like this: