Read files inside ZipArchive lazily #89

Arnavion · 2018-11-23T04:34:51Z

Before this change, the entire central directory would be parsed to extract
the details of all files in the archive up-front.

With this change, the central directory is parsed on-demand - only enough
headers are parsed till the caller finds the file it wanted.

This speeds up callers who only wish to pick out specific files from
the archive, especially for archives with many thousands of files or
archives hosted on slow storage media. In the case where the caller did want
every file in the archive, this performs no worse than the original code.

Before this change, the entire central directory would be parsed to extract the details of all files in the archive up-front. With this change, the central directory is parsed on-demand - only enough headers are parsed till the caller finds the file it wanted. This speeds up callers who only wish to pick out specific files from the archive, especially for archives with many thousands of files or archives hosted on slow storage media. In the case where the caller *did* want every file in the archive, this performs no worse than the original code.

srijs · 2018-11-24T11:39:23Z

Nice! Just tried out this patch on a project I'm working on, which is essentially extracting a handful of files from a very large archive, and it reduced total runtime by ~30% from 180-200ms down to 120-130ms.

The reader can be misaligned if it encounters a corrupt file header before it find the desired file. For example, consider an archive that contains files A, B, C and D, and the user requests file D (say via `by_index(3)`). Let's say the reader successfully parses the headers of A and B, but fails when parsing C because it's corrupt. In this case, we want the user to receive an error, but we also want to ensure that future calls to `by_index` or `by_name` don't parse A and B's headers again. One way to do this is to preserve the end position of B's header so that future calls to `read_files_till` resume from it. However since we've already parsed B's header and know C's header failed to parse, we can simply consider all headers that haven't already been parsed to be unreachable. This commit marks the archive as poisoned if `read_files_till` fails with any error, so that only headers which have already been successfully parsed will be considered when getting files from the archive. If a file is requested that hasn't been parsed, ie its header lies beyond the known corrupted header, then the function fails immediately.

Arnavion added 3 commits November 22, 2018 20:25

Don't need to seek on every iteration.

f27bd77

Reduce number of seeks.

620ebfc

srijs mentioned this pull request Dec 5, 2018

Zero copy central directory parsing #91

Closed

Arnavion closed this Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read files inside ZipArchive lazily #89

Read files inside ZipArchive lazily #89

Arnavion commented Nov 23, 2018

srijs commented Nov 24, 2018

Read files inside ZipArchive lazily #89

Read files inside ZipArchive lazily #89

Conversation

Arnavion commented Nov 23, 2018

srijs commented Nov 24, 2018