Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned #231

ttencate · 2024-08-08T10:31:56Z

Describe the bug

I have a 266 MB zip file, from which I only need to extract a 1 kB file. The rest of the files in the archive are irrelevant at this stage in the program.

However, opening the zip file using ZipArchive::new(file) takes about 7 seconds. It's a lot faster the second time round, because of Linux's filesystem cache.

I traced the root cause to Zip32CentralDirectoryEnd::find_and_parse, which locates the "end of central directory record" very quickly at the end of the file, but then keeps scanning backwards through the entire file to find another one.

To Reproduce

Have a large zip file:

$ ls -lh archive.zip
-rw-r--r-- 1 thomas thomas 266M Aug  8 12:20 archive.zip
$ cargo build --release
$ echo 3 | sudo tee /proc/sys/vm/drop_caches  # Flush filesystem cache (Linux only)
$ time target/release/repro
real	0m6.714s
user	0m0.560s
sys	0m1.293s

Use this as the main program:

fn main() {
    let file = std::fs::File::open("archive.zip").unwrap();
    let archive = zip::ZipArchive::new(file).unwrap();
}

Expected behavior

Extracting a single 1 kB file from a large archive should be possible quickly. unzip can do it:

$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time unzip -l archive.zip
Archive:  archive.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
...
---------                     -------
1228949561                     9 files

real	0m0.012s
user	0m0.005s
sys	0m0.000s
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
$ time unzip archive.zip some_file.txt
Archive:  archive.zip
  inflating: some_file.txt           

real	0m0.012s
user	0m0.000s
sys	0m0.005s

Version

zip 2.1.6. This is also happening in 2.1.4, but not in 2.1.3. I think cb2d7ab or 9bf914d is the cause, but I haven't dug deeper.

The text was updated successfully, but these errors were encountered:

RisaI · 2024-08-13T18:38:26Z

Can also confirm the regression. In our case, the difference is extreme (by an order of magnitude).

electimon · 2024-08-15T20:02:42Z

Also encountering this with extracting single tiny file from multiple small zip files (>9000 count), i thought i was going crazy, in my case llseek seems to be taking up alot of cputime.

newinnovations · 2024-08-17T17:21:20Z

I can also confirm. Extracting a 109 KB file from a 200 MB archive:

In 2.1.3:

       extract (zip)    0.2 ms (109 KB)

In 2.1.6:

       extract (zip)  675.5 ms (109 KB)

newinnovations · 2024-08-23T17:59:24Z

In 2.2.0:

       extract (zip)  683.4 ms (109 KB)

RisaI · 2024-09-26T19:26:16Z

The regression is indeed caused by cb2d7ab, as @ttencate suspected (confirmed by bisecting).

Looking at the code, the new way of looking for magic bytes is terribly (I mean extremely, by many orders of magnitude) slower. The correct way would be to lazy-load the potential CDE sections, but the current code collects ALL of the candidates before proceeding further. This is is further made worse in the zip64 case where there's a subsequent forward search for every candidate. The zip file is therefore read from back to start entirely and from each CDE candidate until the end again.

The back-to-start pass itself is extremely inefficient, because only the first case of a valid CDE is ever necessary. The backtracking then completely nukes performance when using network shares, for example (as was the case in our production environment). It also causes a 2000x slowdown in my local tests (44MB ZIP with 30 files, NVME SSD, 64GB RAM, 32 threads).

ttencate · 2024-12-16T12:15:05Z

Thank you for your excellent work @Pr0methean! I think the issue closure was triggered by your PR description, but @nickbabcock mentions that there's still a regression. Do you think we should reopen this one, or file a new one for that?

richardstephens · 2024-12-16T12:45:08Z

When we hit this issue, we implemented a regression test to test for the performance regression, in case the dependency got bumped by mistake. We implemented an instrumented cursor to track how many seeks and how much data was being read. The numbers we are seeing are as follows:

15kb test file

zip2 v2.1.3: seek_count=25 read_bytes_count=1223
zip2 v2.2.1: seek_count=39 read_bytes_count=17095
zip2@33c71cc: seek_count=36 read_bytes_count=2739

4MB test file

zip2 v2.1.3: seek_count=34 read_bytes_count=1476
zip2 v2.2.1: seek_count=69 read_bytes_count=140228
zip2@33c71cc:: seek_count=51 read_bytes_count=2992

RisaI · 2024-12-16T12:54:47Z

My recently merged PR only changed the EOCD detection algorithm and there are additional reads occurring after that. I'll try to reproduce @richardstephens' approach and find the culprit.

richardstephens · 2024-12-16T12:58:14Z

Here is the relevant code from our test case


    use std::io::{Cursor, Read, Seek, SeekFrom};
    use std::sync::atomic::Ordering::SeqCst;
    use std::sync::atomic::{AtomicU64, Ordering};
    use std::sync::Arc;
    struct SeekCountingCursor {
        inner: Cursor<Vec<u8>>,
        seek_count: Arc<AtomicU64>,
        read_bytes_count: Arc<AtomicU64>,
    }

    impl SeekCountingCursor {
        pub fn new(data: &[u8]) -> SeekCountingCursor {
            SeekCountingCursor {
                inner: Cursor::new(data.to_vec()),
                seek_count: Default::default(),
                read_bytes_count: Default::default(),
            }
        }
        pub fn seek_counter(&self) -> Arc<AtomicU64> {
            self.seek_count.clone()
        }
        pub fn read_bytes_counter(&self) -> Arc<AtomicU64> {
            self.read_bytes_count.clone()
        }
    }
    impl Seek for SeekCountingCursor {
        fn seek(&mut self, pos: SeekFrom) -> std::io::Result<u64> {
            self.seek_count.fetch_add(1, Ordering::SeqCst);
            self.inner.seek(pos)
        }
    }

    impl Read for SeekCountingCursor {
        fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
            match self.inner.read(buf) {
                Ok(x) => {
                    self.read_bytes_count.fetch_add(x as u64, Ordering::SeqCst);
                    Ok(x)
                }
                e => e,
            }
        }
    }

sylvestre · 2024-12-16T13:00:25Z

https://github.com/mstange/samply is an excellent for perf profiling (sorry you already know)

nickbabcock · 2024-12-16T13:33:48Z

For some more context. Here is the callgrind profiling graph for an application benchmark (with #247 merged). 99% of the time in the application is spent in ZipArchive::get_metadata. I believe the baseline should be something close to zero

I know it's not much to go off of. I'll see how much more a I can investigate.

richardstephens · 2024-12-16T13:44:49Z

I did some more testing with a much larger file (7GB)

v2.1.3: seek_count=39 read_bytes_count=1706
v2.2.1: seek_count=72 read_bytes_count=140412
33c71cc: seek_count=52 read_bytes_count=3180

I also benchmarked the "real-world" application of this code, with the same 7GB file

v2.1.3:real	0m 0.15s user	0m 0.10s sys	0m 0.04s
v2.2.1:real	0m 0.16s user	0m 0.10s sys	0m 0.05s
33c71cc: real	0m 0.17s user	0m 0.09s sys	0m 0.07s

This left me scratching my head a bit, because I remember the performance being a lot worse when we first noticed the problem.

Then I decided to try 2.2.0

seek_count=16387121 read_bytes_count=8390216616
real	0m 24.09s user	0m 13.19s sys	0m 10.87s

So it looks as if this was already mostly fixed in 2.2.1.

Pr0methean · 2024-12-16T16:14:30Z

You may get a further improvement in 2.2.2.

richardstephens · 2024-12-17T10:56:17Z

You may get a further improvement in 2.2.2.

I was using current master (commit 33c71cc) as stand-in for 2.2.2 because I can't see 2.2.2 on crates.io just yet. For our purposes, with these fixes the performance is now good enough and I can drop the pin on 2.1.3.

Would there be any interest in a PR to upstream the test case, to catch future regressions?

Sytten · 2024-12-17T15:56:04Z

@Pr0methean Do you have thoughts on @nickbabcock concerns? Should this issue be kept open or are you satisfied with the state of things?

ttencate added the bug Something isn't working label Aug 8, 2024

AJIOB mentioned this issue Sep 6, 2024

chore: Bump deps to latest mozilla/sccache#2235

Open

RisaI mentioned this issue Sep 27, 2024

fix: rewrite the EOCD/EOCD64 detection to fix extreme performance regression #247

Merged

3 tasks

RisaI mentioned this issue Oct 18, 2024

Crate maintenance and its future #250

Open

sylvestre mentioned this issue Oct 22, 2024

Bump zip from 0.6.6 to 2.1.3 mozilla/sccache#2227

Open

wolfv mentioned this issue Oct 30, 2024

Building from a local sdist file url broken in 0.21.0 prefix-dev/rattler-build#1045

Closed

iliana mentioned this issue Oct 31, 2024

Upgrade zip to 2.1.3 oxidecomputer/omicron#6964

Merged

Pr0methean closed this as completed in #247 Dec 16, 2024

jrudolph mentioned this issue Jan 17, 2025

zip 2.2.2 scans for large parts of the file while opening a ZipArchive #280

Open

flying-sheep mentioned this issue Feb 9, 2025

[zarrs_zip] Investigate async compatible ways to do zip LDeakin/zarrs#143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned #231

Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned #231

ttencate commented Aug 8, 2024

RisaI commented Aug 13, 2024

electimon commented Aug 15, 2024 •

edited

Loading

newinnovations commented Aug 17, 2024

newinnovations commented Aug 23, 2024

RisaI commented Sep 26, 2024 •

edited

Loading

ttencate commented Dec 16, 2024

richardstephens commented Dec 16, 2024

RisaI commented Dec 16, 2024

richardstephens commented Dec 16, 2024 •

edited

Loading

sylvestre commented Dec 16, 2024

nickbabcock commented Dec 16, 2024

richardstephens commented Dec 16, 2024 •

edited

Loading

Pr0methean commented Dec 16, 2024

richardstephens commented Dec 17, 2024

Sytten commented Dec 17, 2024

Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned #231

Regression: opening large zip files is slow since 2.1.4 because the entire file is scanned #231

Comments

ttencate commented Aug 8, 2024

RisaI commented Aug 13, 2024

electimon commented Aug 15, 2024 • edited Loading

newinnovations commented Aug 17, 2024

newinnovations commented Aug 23, 2024

RisaI commented Sep 26, 2024 • edited Loading

ttencate commented Dec 16, 2024

richardstephens commented Dec 16, 2024

RisaI commented Dec 16, 2024

richardstephens commented Dec 16, 2024 • edited Loading

sylvestre commented Dec 16, 2024

nickbabcock commented Dec 16, 2024

richardstephens commented Dec 16, 2024 • edited Loading

Pr0methean commented Dec 16, 2024

richardstephens commented Dec 17, 2024

Sytten commented Dec 17, 2024

electimon commented Aug 15, 2024 •

edited

Loading

RisaI commented Sep 26, 2024 •

edited

Loading

richardstephens commented Dec 16, 2024 •

edited

Loading

richardstephens commented Dec 16, 2024 •

edited

Loading