-
Notifications
You must be signed in to change notification settings - Fork 204
Conversation
(the build failures appear to be due to |
let version_made_by = reader.read_u16::<LittleEndian>()?; | ||
let _version_to_extract = reader.read_u16::<LittleEndian>()?; | ||
let flags = reader.read_u16::<LittleEndian>()?; | ||
let version_made_by = bytes.get_u16_le(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the central directory header lied and the size it reports is too small? Then the central directory buffer might end in the middle of a record, and one of these get_*
will panic. The original read_*
would have read garbage (bits of the end-of-file record) or return Err(UnexpectedEof)
Since you already gave it a Read
impl and still use the original read_u32
on line 305, maybe keep the read_*
here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be possible, but then we're still left with the calls to split_to
for which Read
doesn't have an equivalent. Maybe the way to do it would be to check the length at the start of the method, and return a Err(UnexpectedEof)
if it is too short.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be possible, but then we're still left with the calls to
split_to
for whichRead
doesn't have an equivalent.
Yes, you'd still have to check .remaining()
manually for the three variable-length fields before split_to()
ing them.
Maybe the way to do it would be to check the length at the start of the method
Yes, this would be the most efficient (avoids the bound checks), though you'd still need to check the three variable-length fields separately from the constant-sized header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, cool, that's what I meant. I'll go and make these changes!
src/buffer.rs
Outdated
|
||
impl From<String> for StrBuf { | ||
fn from(s: String) -> Self { | ||
let ByteBuf(bytes) = ByteBuf::from(s.into_bytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not create the bytes::Bytes
directly from s.into_bytes()
with .into()
, instead of round-tripping it through ByteBuf::from
? (Though it would get optimized the same anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this must be a leftover of a refactor I did. I'll get rid of it!
} | ||
} | ||
|
||
impl Hash for StrBuf { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not #[derive]
it? Is there a benefit to matching the str
's hash rather than the Bytes
's hash (which String<Bytes>
's Hash
forwards to)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the impl Borrow<str> for StrBuf
, I need to make sure that the hash implementations behave identically between str
and StrBuf
.
Quoting the documentation for the Borrow
trait:
Further, when providing implementations for additional traits, it needs to be considered whether they should behave identical to those of the underlying type as a consequence of acting as a representation of that underlying type. Generic code typically uses
Borrow<T>
when it relies on the identical behavior of these additional trait implementations. These traits will likely appear as additional trait bounds.
Without this, it would be impossible to look up an entry from the names_map
by &str
, since the str
and StrBuf
would evaluate to different hashes, even if they represent the same sequence of chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't notice the Borrow<str>
impl.
Before this change, the entire central directory would be parsed to extract the details of all files in the archive up-front. With this change, the central directory is parsed on-demand - only enough headers are parsed till the caller finds the file it wanted. This speeds up callers who only wish to pick out specific files from the archive, especially for archives with many thousands of files or archives hosted on slow storage media. In the case where the caller *did* want every file in the archive, this performs no worse than the original code.
The reader can be misaligned if it encounters a corrupt file header before it find the desired file. For example, consider an archive that contains files A, B, C and D, and the user requests file D (say via `by_index(3)`). Let's say the reader successfully parses the headers of A and B, but fails when parsing C because it's corrupt. In this case, we want the user to receive an error, but we also want to ensure that future calls to `by_index` or `by_name` don't parse A and B's headers again. One way to do this is to preserve the end position of B's header so that future calls to `read_files_till` resume from it. However since we've already parsed B's header and know C's header failed to parse, we can simply consider all headers that haven't already been parsed to be unreachable. This commit marks the archive as poisoned if `read_files_till` fails with any error, so that only headers which have already been successfully parsed will be considered when getting files from the archive. If a file is requested that hasn't been parsed, ie its header lies beyond the known corrupted header, then the function fails immediately.
c4eeb6a
to
8ad6bd4
Compare
Rebased on current master, which will hopefully fix the CI errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but we'll need to justify the extra complexity (and dependency). I want to benchmark the original changes in #89 to see the difference
Hi! This PR has been up for a while, and I‘ve stopped working on the project I needed this for.
I’m happy for someone else to take the baton here if you‘re interested in improving zip parsing performance, but unfortunately I don‘t have the time or interest to carry this forward at the moment.
|
Thanks for the contribution ❤️ I'm sure it will factor into the more streaming-oriented design we're working on. |
perf: use indexmap in read::zip_archive::Shared instead of a separate vec and hashmap
This changeset is based on #89, and aims to improve performance further by minimising allocations and memcpy's when parsing central directory records.
In a nutshell, it reads the full central directory into a buffer, and then relies on the changes in #89 to incrementally parse the bytes. While parsing, filenames and comments are made to share the underlying data of the main buffer, avoiding any additional allocations and copies.
Overall this improves performance for the
file_info
example by more than 25% (for a ~60MB, 20K entries zip file from ~470ms down to ~340ms).