Zero copy central directory parsing #91

srijs · 2018-12-05T10:24:26Z

This changeset is based on #89, and aims to improve performance further by minimising allocations and memcpy's when parsing central directory records.

In a nutshell, it reads the full central directory into a buffer, and then relies on the changes in #89 to incrementally parse the bytes. While parsing, filenames and comments are made to share the underlying data of the main buffer, avoiding any additional allocations and copies.

Overall this improves performance for the file_info example by more than 25% (for a ~60MB, 20K entries zip file from ~470ms down to ~340ms).

srijs · 2018-12-05T10:31:54Z

(the build failures appear to be due to libflate failing to compile with Rust 1.20)

Arnavion · 2018-12-11T17:26:46Z

src/read.rs

-    let version_made_by = reader.read_u16::<LittleEndian>()?;
-    let _version_to_extract = reader.read_u16::<LittleEndian>()?;
-    let flags = reader.read_u16::<LittleEndian>()?;
+    let version_made_by = bytes.get_u16_le();


What if the central directory header lied and the size it reports is too small? Then the central directory buffer might end in the middle of a record, and one of these get_* will panic. The original read_* would have read garbage (bits of the end-of-file record) or return Err(UnexpectedEof)

Since you already gave it a Read impl and still use the original read_u32 on line 305, maybe keep the read_* here as well?

That would be possible, but then we're still left with the calls to split_to for which Read doesn't have an equivalent. Maybe the way to do it would be to check the length at the start of the method, and return a Err(UnexpectedEof) if it is too short.

That would be possible, but then we're still left with the calls to split_to for which Read doesn't have an equivalent.

Yes, you'd still have to check .remaining() manually for the three variable-length fields before split_to()ing them.

Maybe the way to do it would be to check the length at the start of the method

Yes, this would be the most efficient (avoids the bound checks), though you'd still need to check the three variable-length fields separately from the constant-sized header.

Yep, cool, that's what I meant. I'll go and make these changes!

Arnavion · 2018-12-11T17:37:19Z

src/buffer.rs

+
+impl From<String> for StrBuf {
+    fn from(s: String) -> Self {
+        let ByteBuf(bytes) = ByteBuf::from(s.into_bytes());


Why not create the bytes::Bytes directly from s.into_bytes() with .into(), instead of round-tripping it through ByteBuf::from ? (Though it would get optimized the same anyway.)

Good catch, this must be a leftover of a refactor I did. I'll get rid of it!

Arnavion · 2018-12-11T17:44:26Z

src/buffer.rs

+    }
+}
+
+impl Hash for StrBuf {


Why not #[derive] it? Is there a benefit to matching the str's hash rather than the Bytes's hash (which String<Bytes>'s Hash forwards to)?

Because of the impl Borrow<str> for StrBuf, I need to make sure that the hash implementations behave identically between str and StrBuf.

Quoting the documentation for the Borrow trait:

Further, when providing implementations for additional traits, it needs to be considered whether they should behave identical to those of the underlying type as a consequence of acting as a representation of that underlying type. Generic code typically uses Borrow<T> when it relies on the identical behavior of these additional trait implementations. These traits will likely appear as additional trait bounds.

Without this, it would be impossible to look up an entry from the names_map by &str, since the str and StrBuf would evaluate to different hashes, even if they represent the same sequence of chars.

Ah, I didn't notice the Borrow<str> impl.

Before this change, the entire central directory would be parsed to extract the details of all files in the archive up-front. With this change, the central directory is parsed on-demand - only enough headers are parsed till the caller finds the file it wanted. This speeds up callers who only wish to pick out specific files from the archive, especially for archives with many thousands of files or archives hosted on slow storage media. In the case where the caller *did* want every file in the archive, this performs no worse than the original code.

The reader can be misaligned if it encounters a corrupt file header before it find the desired file. For example, consider an archive that contains files A, B, C and D, and the user requests file D (say via `by_index(3)`). Let's say the reader successfully parses the headers of A and B, but fails when parsing C because it's corrupt. In this case, we want the user to receive an error, but we also want to ensure that future calls to `by_index` or `by_name` don't parse A and B's headers again. One way to do this is to preserve the end position of B's header so that future calls to `read_files_till` resume from it. However since we've already parsed B's header and know C's header failed to parse, we can simply consider all headers that haven't already been parsed to be unreachable. This commit marks the archive as poisoned if `read_files_till` fails with any error, so that only headers which have already been successfully parsed will be considered when getting files from the archive. If a file is requested that hasn't been parsed, ie its header lies beyond the known corrupted header, then the function fails immediately.

srijs · 2018-12-15T05:27:27Z

Rebased on current master, which will hopefully fix the CI errors.

Plecra

LGTM, but we'll need to justify the extra complexity (and dependency). I want to benchmark the original changes in #89 to see the difference

srijs · 2020-06-26T08:54:59Z

Hi! This PR has been up for a while, and I‘ve stopped working on the project I needed this for. I’m happy for someone else to take the baton here if you‘re interested in improving zip parsing performance, but unfortunately I don‘t have the time or interest to carry this forward at the moment.

Plecra · 2020-06-26T10:14:41Z

Thanks for the contribution ❤️ I'm sure it will factor into the more streaming-oriented design we're working on.

perf: use indexmap in read::zip_archive::Shared instead of a separate vec and hashmap

Arnavion reviewed Dec 11, 2018

View reviewed changes

Arnavion and others added 5 commits December 15, 2018 16:22

Don't need to seek on every iteration.

2e5b29e

Reduce number of seeks.

a1eb92c

zero-copy central directory parsing

8ad6bd4

srijs force-pushed the zero-copy-central-directory-parsing branch from c4eeb6a to 8ad6bd4 Compare December 15, 2018 05:25

improve bounds checking for central directory header parsing

5d5491d

Plecra suggested changes Jun 26, 2020

View reviewed changes

Plecra mentioned this pull request Jun 26, 2020

WIP: API Design #156

Closed

Plecra closed this Jun 26, 2020

Pr0methean added a commit to cosmicexplorer/zip that referenced this pull request May 4, 2024

Merge pull request zip-rs#91 from cosmicexplorer/oldpr396a

5ad3fe4

perf: use indexmap in read::zip_archive::Shared instead of a separate vec and hashmap

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero copy central directory parsing #91

Zero copy central directory parsing #91

srijs commented Dec 5, 2018

srijs commented Dec 5, 2018

Arnavion Dec 11, 2018

srijs Dec 15, 2018

Arnavion Dec 15, 2018

srijs Dec 15, 2018

Arnavion Dec 11, 2018

srijs Dec 15, 2018

Arnavion Dec 11, 2018

srijs Dec 15, 2018

Arnavion Dec 15, 2018

srijs commented Dec 15, 2018

Plecra left a comment

srijs commented Jun 26, 2020 via email

Plecra commented Jun 26, 2020

Zero copy central directory parsing #91

Zero copy central directory parsing #91

Conversation

srijs commented Dec 5, 2018

srijs commented Dec 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srijs commented Dec 15, 2018

Plecra left a comment

Choose a reason for hiding this comment

srijs commented Jun 26, 2020 via email

Plecra commented Jun 26, 2020