-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZSTD_c_targetCBlockSize == ZSTD_TARGETCBLOCKSIZE_MAX leads to improbably bad compression #2096
Comments
I'm going to ask a question here regarding architecture: does targetCBlockSize cause the compressor to assume that it is going to be fed into a decompressor who will have no knowledge outside of the block it receives? i.e. it is a fully bufferless streaming decompressor? |
I'm not sure whats going on in this scenario, but with the input it should be easy to reproduce and fix. I will look into it soon. Thanks again for the report and the detailed repro instructions! |
Fixes: Enable RLE blocks for superblock mode Fix the limitation that the literals block must shrink. Instead, when we're within 200 bytes of the next header byte size, we will just use the next one up. That way we should (almost?) always have space for the table. Remove the limitation that the first sub-block MUST have compressed literals and be compressed. Now one sub-block MUST be compressed (otherwise we fall back to raw block which is okay, since that is streamable). If no block has compressed literals that is okay, we will fix up the next Huffman table. Handle the case where the last sub-block is uncompressed (maybe it is very small). Before it would skip superblock in this case, now we allow the last sub-block to be uncompressed. To do this we need to regenerate the correct repcodes. Respect disableLiteralsCompression in superblock mode Fix superblock mode to handle a block consisting of only compressed literals Fix a off by 1 error in superblock mode that disabled it whenever there were last literals Fix superblock mode with long literals/matches (> 0xFFFF) Allow superblock mode to repeat Huffman tables Respect ZSTD_minGain(). Tests: Simple check for the condition in #2096. When the simple_round_trip fuzzer enables superblock mode, it checks that the compressed size isn't expanded too much. Remaining limitations: O(targetCBlockSize^2) because we recompute statistics every sequence Unable to split literals of length > targetCBlockSize into multiple sequences Refuses to generate sub-blocks that don't shrink the compressed data, so we could end up with large sub-blocks. We should emit those sections as uncompressed blocks instead. ... Fixes #2096
Using
ZSTD_CCtxParams_setParameter(cctxParams, ZSTD_c_targetCBlockSize, ZSTD_TARGETCBLOCKSIZE_MAX)
leads to improbably bad compression for sources with large minimum sequences (>1kB).Using a repeated concatenation of a minified CSS:
With setting targetCBlockSize == ZSTD_TARGETBLOCKSIZE_MAX:
bootstrap.min.css : 97.31% (13862462 => 13488900 bytes, bootstrap.min.css.zst)
Without setting targetCBlockSize:
bootstrap.min.css : 0.15% (13862462 => 20476 bytes, bootstrap.min.css.zst)
Further note: I encountered this as a result of trying to reduce the block size on decompress regarding #2093. If instead of setting the TargetCompressedBlockSize, ZSTD_BLOCKSIZEMAX itself is reduced, in turn causing the maximum blocksize to be reduced as well, it has a much smaller impact on the compression.
The source used was obtained via the following sequence:
rm bootstrap.min.css; wget https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css && for i in {1..5}; do cat bootstrap.min.css >> bootstrap_2.min.css; cat bootstrap_2.min.css >> bootstrap.min.css; done && rm bootstrap_2.min.css
The text was updated successfully, but these errors were encountered: