Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned #231

Closed
ttencate opened this issue Aug 8, 2024 · 15 comments · Fixed by #247
Closed
Labels
bug Something isn't working

Comments

@ttencate
Copy link

ttencate commented Aug 8, 2024

Describe the bug

I have a 266 MB zip file, from which I only need to extract a 1 kB file. The rest of the files in the archive are irrelevant at this stage in the program.

However, opening the zip file using ZipArchive::new(file) takes about 7 seconds. It's a lot faster the second time round, because of Linux's filesystem cache.

I traced the root cause to Zip32CentralDirectoryEnd::find_and_parse, which locates the "end of central directory record" very quickly at the end of the file, but then keeps scanning backwards through the entire file to find another one.

To Reproduce

Have a large zip file:

$ ls -lh archive.zip
-rw-r--r-- 1 thomas thomas 266M Aug  8 12:20 archive.zip
$ cargo build --release
$ echo 3 | sudo tee /proc/sys/vm/drop_caches  # Flush filesystem cache (Linux only)
$ time target/release/repro
real	0m6.714s
user	0m0.560s
sys	0m1.293s

Use this as the main program:

fn main() {
    let file = std::fs::File::open("archive.zip").unwrap();
    let archive = zip::ZipArchive::new(file).unwrap();
}

Expected behavior

Extracting a single 1 kB file from a large archive should be possible quickly. unzip can do it:

$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time unzip -l archive.zip
Archive:  archive.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
...
---------                     -------
1228949561                     9 files

real	0m0.012s
user	0m0.005s
sys	0m0.000s
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time unzip archive.zip some_file.txt
Archive:  archive.zip
  inflating: some_file.txt           

real	0m0.012s
user	0m0.000s
sys	0m0.005s

Version

zip 2.1.6. This is also happening in 2.1.4, but not in 2.1.3. I think cb2d7ab or 9bf914d is the cause, but I haven't dug deeper.

@ttencate ttencate added the bug Something isn't working label Aug 8, 2024
@RisaI
Copy link
Contributor

RisaI commented Aug 13, 2024

Can also confirm the regression. In our case, the difference is extreme (by an order of magnitude).

@electimon
Copy link

electimon commented Aug 15, 2024

Also encountering this with extracting single tiny file from multiple small zip files (>9000 count), i thought i was going crazy, in my case llseek seems to be taking up alot of cputime.

@newinnovations
Copy link

I can also confirm. Extracting a 109 KB file from a 200 MB archive:

In 2.1.3:

       extract (zip)    0.2 ms (109 KB)

In 2.1.6:

       extract (zip)  675.5 ms (109 KB)

@newinnovations
Copy link

In 2.2.0:

       extract (zip)  683.4 ms (109 KB)

@RisaI
Copy link
Contributor

RisaI commented Sep 26, 2024

The regression is indeed caused by cb2d7ab, as @ttencate suspected (confirmed by bisecting).

Looking at the code, the new way of looking for magic bytes is terribly (I mean extremely, by many orders of magnitude) slower. The correct way would be to lazy-load the potential CDE sections, but the current code collects ALL of the candidates before proceeding further. This is is further made worse in the zip64 case where there's a subsequent forward search for every candidate. The zip file is therefore read from back to start entirely and from each CDE candidate until the end again.

The back-to-start pass itself is extremely inefficient, because only the first case of a valid CDE is ever necessary. The backtracking then completely nukes performance when using network shares, for example (as was the case in our production environment). It also causes a 2000x slowdown in my local tests (44MB ZIP with 30 files, NVME SSD, 64GB RAM, 32 threads).

@ttencate
Copy link
Author

Thank you for your excellent work @Pr0methean! I think the issue closure was triggered by your PR description, but @nickbabcock mentions that there's still a regression. Do you think we should reopen this one, or file a new one for that?

@richardstephens
Copy link

When we hit this issue, we implemented a regression test to test for the performance regression, in case the dependency got bumped by mistake. We implemented an instrumented cursor to track how many seeks and how much data was being read. The numbers we are seeing are as follows:

15kb test file

  • zip2 v2.1.3: seek_count=25 read_bytes_count=1223
  • zip2 v2.2.1: seek_count=39 read_bytes_count=17095
  • zip2@33c71cc: seek_count=36 read_bytes_count=2739

4MB test file

  • zip2 v2.1.3: seek_count=34 read_bytes_count=1476
  • zip2 v2.2.1: seek_count=69 read_bytes_count=140228
  • zip2@33c71cc:: seek_count=51 read_bytes_count=2992

@RisaI
Copy link
Contributor

RisaI commented Dec 16, 2024

My recently merged PR only changed the EOCD detection algorithm and there are additional reads occurring after that. I'll try to reproduce @richardstephens' approach and find the culprit.

@richardstephens
Copy link

richardstephens commented Dec 16, 2024

Here is the relevant code from our test case


    use std::io::{Cursor, Read, Seek, SeekFrom};
    use std::sync::atomic::Ordering::SeqCst;
    use std::sync::atomic::{AtomicU64, Ordering};
    use std::sync::Arc;
    struct SeekCountingCursor {
        inner: Cursor<Vec<u8>>,
        seek_count: Arc<AtomicU64>,
        read_bytes_count: Arc<AtomicU64>,
    }

    impl SeekCountingCursor {
        pub fn new(data: &[u8]) -> SeekCountingCursor {
            SeekCountingCursor {
                inner: Cursor::new(data.to_vec()),
                seek_count: Default::default(),
                read_bytes_count: Default::default(),
            }
        }
        pub fn seek_counter(&self) -> Arc<AtomicU64> {
            self.seek_count.clone()
        }
        pub fn read_bytes_counter(&self) -> Arc<AtomicU64> {
            self.read_bytes_count.clone()
        }
    }
    impl Seek for SeekCountingCursor {
        fn seek(&mut self, pos: SeekFrom) -> std::io::Result<u64> {
            self.seek_count.fetch_add(1, Ordering::SeqCst);
            self.inner.seek(pos)
        }
    }

    impl Read for SeekCountingCursor {
        fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
            match self.inner.read(buf) {
                Ok(x) => {
                    self.read_bytes_count.fetch_add(x as u64, Ordering::SeqCst);
                    Ok(x)
                }
                e => e,
            }
        }
    }

@sylvestre
Copy link

https://github.com/mstange/samply is an excellent for perf profiling (sorry you already know)

@nickbabcock
Copy link
Contributor

For some more context. Here is the callgrind profiling graph for an application benchmark (with #247 merged). 99% of the time in the application is spent in ZipArchive::get_metadata. I believe the baseline should be something close to zero

benchmark

I know it's not much to go off of. I'll see how much more a I can investigate.

@richardstephens
Copy link

richardstephens commented Dec 16, 2024

I did some more testing with a much larger file (7GB)

v2.1.3: seek_count=39 read_bytes_count=1706
v2.2.1: seek_count=72 read_bytes_count=140412
33c71cc: seek_count=52 read_bytes_count=3180

I also benchmarked the "real-world" application of this code, with the same 7GB file

v2.1.3:real	0m 0.15s user	0m 0.10s sys	0m 0.04s
v2.2.1:real	0m 0.16s user	0m 0.10s sys	0m 0.05s
33c71cc: real	0m 0.17s user	0m 0.09s sys	0m 0.07s

This left me scratching my head a bit, because I remember the performance being a lot worse when we first noticed the problem.

Then I decided to try 2.2.0

seek_count=16387121 read_bytes_count=8390216616
real	0m 24.09s user	0m 13.19s sys	0m 10.87s

So it looks as if this was already mostly fixed in 2.2.1.

@Pr0methean
Copy link
Member

You may get a further improvement in 2.2.2.

@richardstephens
Copy link

You may get a further improvement in 2.2.2.

I was using current master (commit 33c71cc) as stand-in for 2.2.2 because I can't see 2.2.2 on crates.io just yet. For our purposes, with these fixes the performance is now good enough and I can drop the pin on 2.1.3.

Would there be any interest in a PR to upstream the test case, to catch future regressions?

@Sytten
Copy link

Sytten commented Dec 17, 2024

@Pr0methean Do you have thoughts on @nickbabcock concerns? Should this issue be kept open or are you satisfied with the state of things?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants