Check uncompressed size before extract entries of archive #30

Smascer · 2021-07-28T13:14:31Z

Some archives can contain a big size files. e.g. (https://github.com/gcc-mirror/gcc/releases/tag/releases%2Fgcc-9.4.0 with testdata) where are tar's located and two of them are 60gb big. Extractcode extract them by default.

It is possible to add size limit for those kind of files? like an ignore option.
or maybe to set the limit of the max. uncompressed size of the whole archive.

Smascer · 2021-08-09T13:20:07Z

in the libarchive2.py you can adapt the def write method of the class Entry with something like this:

if self.size > MAX_ENTRY_SIZE: return

where MAX_ENTRY_SIZE is set, e.g. with: 524288000 to skip all this big files

pombredanne · 2021-08-09T13:33:35Z

It is possible to add size limit for those kind of files? like an ignore option.
or maybe to set the limit of the max. uncompressed size of the whole archive.

Sure thing. I would like to make it available everywhere as an argument though with the caveat that we cannot always know the uncompressed size before effectively decompressing in some cases.

What API and behaviour do you think this should have?

Ben-Thelen · 2021-08-10T06:41:33Z

From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z?
It would maybe be nice to then be able to skip those completely maybe.

So there would be three modes:

Normally extracting all archives
Skipping too large archives where the information is available
Skipping too large archives where the information is available and not extracting otherwise

Smascer · 2021-08-10T06:45:41Z

I would say best way is by default to write all the entries.
But if you set via CLI e.g.: extractcode --max-archive-size 512 (with 512MB) the value will be set as argument everywhere and before writing checked and maybe skipped.

pombredanne · 2021-08-10T13:46:15Z

From what I've gathered the uncompressed size is only available in libarchive but not in e.g. 7z?
It would maybe be nice to then be able to skip those completely maybe.

FWIW, we may be able to get that also from 7-zip-suppoorted archives since we can parse a directory listing: https://github.com/nexB/extractcode/blob/533ac8a7cf9d83c9fb43600b6b952a62da9acc12/src/extractcode/sevenzip.py#L697

But the other approach may be to start writing in chunks until a max size is reached and then abort/rollback in these cases AND return some warning/error with the "extract event" stating that this file was not extracted because of a threshold limit.

pombredanne · 2021-10-08T16:06:06Z

There is a related issue with a 60GB sparse file reported in #32 reported by @goekDil
For all I know I would not be surprised that this is the exact same file :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check uncompressed size before extract entries of archive #30

Check uncompressed size before extract entries of archive #30

Smascer commented Jul 28, 2021

Smascer commented Aug 9, 2021

pombredanne commented Aug 9, 2021

Ben-Thelen commented Aug 10, 2021

Smascer commented Aug 10, 2021

pombredanne commented Aug 10, 2021

pombredanne commented Oct 8, 2021

Check uncompressed size before extract entries of archive #30

Check uncompressed size before extract entries of archive #30

Comments

Smascer commented Jul 28, 2021

Smascer commented Aug 9, 2021

pombredanne commented Aug 9, 2021

Ben-Thelen commented Aug 10, 2021

Smascer commented Aug 10, 2021

pombredanne commented Aug 10, 2021

pombredanne commented Oct 8, 2021