Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check uncompressed size before extract entries of archive #30

Open
Smascer opened this issue Jul 28, 2021 · 6 comments
Open

Check uncompressed size before extract entries of archive #30

Smascer opened this issue Jul 28, 2021 · 6 comments

Comments

@Smascer
Copy link

Smascer commented Jul 28, 2021

Some archives can contain a big size files. e.g. (https://github.com/gcc-mirror/gcc/releases/tag/releases%2Fgcc-9.4.0 with testdata) where are tar's located and two of them are 60gb big. Extractcode extract them by default.

It is possible to add size limit for those kind of files? like an ignore option.
or maybe to set the limit of the max. uncompressed size of the whole archive.

@Smascer
Copy link
Author

Smascer commented Aug 9, 2021

in the libarchive2.py you can adapt the def write method of the class Entry with something like this:

if self.size > MAX_ENTRY_SIZE: return

where MAX_ENTRY_SIZE is set, e.g. with: 524288000 to skip all this big files

@pombredanne
Copy link
Member

It is possible to add size limit for those kind of files? like an ignore option.
or maybe to set the limit of the max. uncompressed size of the whole archive.

Sure thing. I would like to make it available everywhere as an argument though with the caveat that we cannot always know the uncompressed size before effectively decompressing in some cases.

What API and behaviour do you think this should have?

@Ben-Thelen
Copy link

From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z?
It would maybe be nice to then be able to skip those completely maybe.

So there would be three modes:

  1. Normally extracting all archives
  2. Skipping too large archives where the information is available
  3. Skipping too large archives where the information is available and not extracting otherwise

@Smascer
Copy link
Author

Smascer commented Aug 10, 2021

I would say best way is by default to write all the entries.
But if you set via CLI e.g.: extractcode --max-archive-size 512 (with 512MB) the value will be set as argument everywhere and before writing checked and maybe skipped.

@pombredanne
Copy link
Member

From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z?
It would maybe be nice to then be able to skip those completely maybe.

FWIW, we may be able to get that also from 7-zip-suppoorted archives since we can parse a directory listing: https://github.com/nexB/extractcode/blob/533ac8a7cf9d83c9fb43600b6b952a62da9acc12/src/extractcode/sevenzip.py#L697

But the other approach may be to start writing in chunks until a max size is reached and then abort/rollback in these cases AND return some warning/error with the "extract event" stating that this file was not extracted because of a threshold limit.

@pombredanne
Copy link
Member

There is a related issue with a 60GB sparse file reported in #32 reported by @goekDil
For all I know I would not be surprised that this is the exact same file :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants