Skip to content
This repository has been archived by the owner on Jun 2, 2024. It is now read-only.

Read files inside ZipArchive lazily #89

Closed
wants to merge 4 commits into from
Closed

Read files inside ZipArchive lazily #89

wants to merge 4 commits into from

Conversation

Arnavion
Copy link

Before this change, the entire central directory would be parsed to extract
the details of all files in the archive up-front.

With this change, the central directory is parsed on-demand - only enough
headers are parsed till the caller finds the file it wanted.

This speeds up callers who only wish to pick out specific files from
the archive, especially for archives with many thousands of files or
archives hosted on slow storage media. In the case where the caller did want
every file in the archive, this performs no worse than the original code.

Arnavion added 3 commits November 22, 2018 20:25
Before this change, the entire central directory would be parsed to extract
the details of all files in the archive up-front.

With this change, the central directory is parsed on-demand - only enough
headers are parsed till the caller finds the file it wanted.

This speeds up callers who only wish to pick out specific files from
the archive, especially for archives with many thousands of files or
archives hosted on slow storage media. In the case where the caller *did* want
every file in the archive, this performs no worse than the original code.
@srijs
Copy link
Contributor

srijs commented Nov 24, 2018

Nice! Just tried out this patch on a project I'm working on, which is essentially extracting a handful of files from a very large archive, and it reduced total runtime by ~30% from 180-200ms down to 120-130ms.

The reader can be misaligned if it encounters a corrupt file header before it
find the desired file.

For example, consider an archive that contains files A, B, C and D, and the user
requests file D (say via `by_index(3)`). Let's say the reader successfully
parses the headers of A and B, but fails when parsing C because it's corrupt.
In this case, we want the user to receive an error, but we also want to ensure
that future calls to `by_index` or `by_name` don't parse A and B's headers
again.

One way to do this is to preserve the end position of B's header so that future
calls to `read_files_till` resume from it. However since we've already parsed
B's header and know C's header failed to parse, we can simply consider all
headers that haven't already been parsed to be unreachable.

This commit marks the archive as poisoned if `read_files_till` fails with any
error, so that only headers which have already been successfully parsed will be
considered when getting files from the archive. If a file is requested that
hasn't been parsed, ie its header lies beyond the known corrupted header,
then the function fails immediately.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants