Skip to content
This repository has been archived by the owner on Jun 2, 2024. It is now read-only.

Zero copy central directory parsing #91

Closed

Conversation

srijs
Copy link
Contributor

@srijs srijs commented Dec 5, 2018

This changeset is based on #89, and aims to improve performance further by minimising allocations and memcpy's when parsing central directory records.

In a nutshell, it reads the full central directory into a buffer, and then relies on the changes in #89 to incrementally parse the bytes. While parsing, filenames and comments are made to share the underlying data of the main buffer, avoiding any additional allocations and copies.

Overall this improves performance for the file_info example by more than 25% (for a ~60MB, 20K entries zip file from ~470ms down to ~340ms).

@srijs
Copy link
Contributor Author

srijs commented Dec 5, 2018

(the build failures appear to be due to libflate failing to compile with Rust 1.20)

let version_made_by = reader.read_u16::<LittleEndian>()?;
let _version_to_extract = reader.read_u16::<LittleEndian>()?;
let flags = reader.read_u16::<LittleEndian>()?;
let version_made_by = bytes.get_u16_le();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the central directory header lied and the size it reports is too small? Then the central directory buffer might end in the middle of a record, and one of these get_* will panic. The original read_* would have read garbage (bits of the end-of-file record) or return Err(UnexpectedEof)

Since you already gave it a Read impl and still use the original read_u32 on line 305, maybe keep the read_* here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be possible, but then we're still left with the calls to split_to for which Read doesn't have an equivalent. Maybe the way to do it would be to check the length at the start of the method, and return a Err(UnexpectedEof) if it is too short.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be possible, but then we're still left with the calls to split_to for which Read doesn't have an equivalent.

Yes, you'd still have to check .remaining() manually for the three variable-length fields before split_to()ing them.

Maybe the way to do it would be to check the length at the start of the method

Yes, this would be the most efficient (avoids the bound checks), though you'd still need to check the three variable-length fields separately from the constant-sized header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, cool, that's what I meant. I'll go and make these changes!

src/buffer.rs Outdated

impl From<String> for StrBuf {
fn from(s: String) -> Self {
let ByteBuf(bytes) = ByteBuf::from(s.into_bytes());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not create the bytes::Bytes directly from s.into_bytes() with .into(), instead of round-tripping it through ByteBuf::from ? (Though it would get optimized the same anyway.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this must be a leftover of a refactor I did. I'll get rid of it!

}
}

impl Hash for StrBuf {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not #[derive] it? Is there a benefit to matching the str's hash rather than the Bytes's hash (which String<Bytes>'s Hash forwards to)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the impl Borrow<str> for StrBuf, I need to make sure that the hash implementations behave identically between str and StrBuf.

Quoting the documentation for the Borrow trait:

Further, when providing implementations for additional traits, it needs to be considered whether they should behave identical to those of the underlying type as a consequence of acting as a representation of that underlying type. Generic code typically uses Borrow<T> when it relies on the identical behavior of these additional trait implementations. These traits will likely appear as additional trait bounds.

Without this, it would be impossible to look up an entry from the names_map by &str, since the str and StrBuf would evaluate to different hashes, even if they represent the same sequence of chars.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't notice the Borrow<str> impl.

Arnavion and others added 5 commits December 15, 2018 16:22
Before this change, the entire central directory would be parsed to extract
the details of all files in the archive up-front.

With this change, the central directory is parsed on-demand - only enough
headers are parsed till the caller finds the file it wanted.

This speeds up callers who only wish to pick out specific files from
the archive, especially for archives with many thousands of files or
archives hosted on slow storage media. In the case where the caller *did* want
every file in the archive, this performs no worse than the original code.
The reader can be misaligned if it encounters a corrupt file header before it
find the desired file.

For example, consider an archive that contains files A, B, C and D, and the user
requests file D (say via `by_index(3)`). Let's say the reader successfully
parses the headers of A and B, but fails when parsing C because it's corrupt.
In this case, we want the user to receive an error, but we also want to ensure
that future calls to `by_index` or `by_name` don't parse A and B's headers
again.

One way to do this is to preserve the end position of B's header so that future
calls to `read_files_till` resume from it. However since we've already parsed
B's header and know C's header failed to parse, we can simply consider all
headers that haven't already been parsed to be unreachable.

This commit marks the archive as poisoned if `read_files_till` fails with any
error, so that only headers which have already been successfully parsed will be
considered when getting files from the archive. If a file is requested that
hasn't been parsed, ie its header lies beyond the known corrupted header,
then the function fails immediately.
@srijs srijs force-pushed the zero-copy-central-directory-parsing branch from c4eeb6a to 8ad6bd4 Compare December 15, 2018 05:25
@srijs
Copy link
Contributor Author

srijs commented Dec 15, 2018

Rebased on current master, which will hopefully fix the CI errors.

Copy link
Member

@Plecra Plecra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but we'll need to justify the extra complexity (and dependency). I want to benchmark the original changes in #89 to see the difference

@srijs
Copy link
Contributor Author

srijs commented Jun 26, 2020 via email

@Plecra Plecra mentioned this pull request Jun 26, 2020
@Plecra
Copy link
Member

Plecra commented Jun 26, 2020

Thanks for the contribution ❤️ I'm sure it will factor into the more streaming-oriented design we're working on.

@Plecra Plecra closed this Jun 26, 2020
Pr0methean added a commit to cosmicexplorer/zip that referenced this pull request May 4, 2024
perf: use indexmap in read::zip_archive::Shared instead of a separate vec and hashmap
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants