-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned #231
Comments
Can also confirm the regression. In our case, the difference is extreme (by an order of magnitude). |
Also encountering this with extracting single tiny file from multiple small zip files (>9000 count), i thought i was going crazy, in my case llseek seems to be taking up alot of cputime. |
I can also confirm. Extracting a 109 KB file from a 200 MB archive: In 2.1.3:
In 2.1.6:
|
In 2.2.0:
|
The regression is indeed caused by cb2d7ab, as @ttencate suspected (confirmed by bisecting). Looking at the code, the new way of looking for magic bytes is terribly (I mean extremely, by many orders of magnitude) slower. The correct way would be to lazy-load the potential CDE sections, but the current code collects ALL of the candidates before proceeding further. This is is further made worse in the zip64 case where there's a subsequent forward search for every candidate. The zip file is therefore read from back to start entirely and from each CDE candidate until the end again. The back-to-start pass itself is extremely inefficient, because only the first case of a valid CDE is ever necessary. The backtracking then completely nukes performance when using network shares, for example (as was the case in our production environment). It also causes a 2000x slowdown in my local tests (44MB ZIP with 30 files, NVME SSD, 64GB RAM, 32 threads). |
Thank you for your excellent work @Pr0methean! I think the issue closure was triggered by your PR description, but @nickbabcock mentions that there's still a regression. Do you think we should reopen this one, or file a new one for that? |
When we hit this issue, we implemented a regression test to test for the performance regression, in case the dependency got bumped by mistake. We implemented an instrumented cursor to track how many seeks and how much data was being read. The numbers we are seeing are as follows: 15kb test file
4MB test file
|
My recently merged PR only changed the EOCD detection algorithm and there are additional reads occurring after that. I'll try to reproduce @richardstephens' approach and find the culprit. |
Here is the relevant code from our test case
|
https://github.com/mstange/samply is an excellent for perf profiling (sorry you already know) |
For some more context. Here is the callgrind profiling graph for an application benchmark (with #247 merged). 99% of the time in the application is spent in I know it's not much to go off of. I'll see how much more a I can investigate. |
I did some more testing with a much larger file (7GB)
I also benchmarked the "real-world" application of this code, with the same 7GB file
This left me scratching my head a bit, because I remember the performance being a lot worse when we first noticed the problem. Then I decided to try 2.2.0
So it looks as if this was already mostly fixed in 2.2.1. |
You may get a further improvement in 2.2.2. |
I was using current master (commit 33c71cc) as stand-in for 2.2.2 because I can't see 2.2.2 on crates.io just yet. For our purposes, with these fixes the performance is now good enough and I can drop the pin on 2.1.3. Would there be any interest in a PR to upstream the test case, to catch future regressions? |
@Pr0methean Do you have thoughts on @nickbabcock concerns? Should this issue be kept open or are you satisfied with the state of things? |
Describe the bug
I have a 266 MB zip file, from which I only need to extract a 1 kB file. The rest of the files in the archive are irrelevant at this stage in the program.
However, opening the zip file using
ZipArchive::new(file)
takes about 7 seconds. It's a lot faster the second time round, because of Linux's filesystem cache.I traced the root cause to
Zip32CentralDirectoryEnd::find_and_parse
, which locates the "end of central directory record" very quickly at the end of the file, but then keeps scanning backwards through the entire file to find another one.To Reproduce
Have a large zip file:
Use this as the main program:
Expected behavior
Extracting a single 1 kB file from a large archive should be possible quickly.
unzip
can do it:Version
zip 2.1.6. This is also happening in 2.1.4, but not in 2.1.3. I think cb2d7ab or 9bf914d is the cause, but I haven't dug deeper.
The text was updated successfully, but these errors were encountered: