-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimizing memory requirements for Decompression? #2093
Comments
The decompressor needs to buffer 1 block of input data at a time (128KB). Once it decompresses a block, it doesn't need to reference it again.
The standard streaming API doesn't currently support writing directly into the output buffer, but you can use the bufferless API (see below). However, I think we could safely support a simplified scenario where the input needs to be buffered & streamed, but the output buffer is fixed. I will put up a PR.
You can use the bufferless streaming API to support this use case. In which case the decompressor only needs to allocate the |
I have a PR up that handles the simple case where you want buffered streaming for the input, but guarantee that the output buffer is large enough for the entire output, and never changes. That will allow you to use the high level streaming API instead of the buffer-less streaming API, which is much simpler. |
The current PR addresses the immediate questions I was having, but places the overall memory requirement at the very high end of workable at 290KB. As for the bufferless streaming API: I'll be honest, I have zero idea whether or not this solves this better. It is not clear at all whether the 'back-reference distance' is indexing into the decompressed output or the raw compressed input. If it were referencing into the decompressed output, why does there need to be an input buffer allocated, other than block efficiency? If it's referencing the raw, compressed input, how can we only buffer one block and not the full WindowSize? The documentation states that:
This makes it look like it's wanting prior input, unless it's actually referring to the resulting decompressed data encoded by the previous block. Much of the documentation for the bufferless API is ambiguous as to whether the item being referenced is in a compressed or uncompressed state. [Edit] To be clear, thank you for implementing what you have done. This provides a concrete path forward in the current setup, even if it does not solve it more generally. |
@dciliske the buffer-less streaming API can shave another ~128KB off of that budget if you don’t need to buffer the input blocks. Since zstd requires to read 1 block at a time, we allocate a buffer to buffer this block. If you have external buffering the buffer-less API can be used to avoid these allocations, which leaves you with only the 160KB ZSTD_DCtx
It is referencing the decompressed output.
It’s referring to the last windowSize bytes of decompressed data. The buffer-less API is certainly not well documented. It is rarely used externally, though it is what we build the standard streaming API with. So no one has spent too much time improving the docs over the years. I will attempt to answer your questions tomorrow morning when I’m at my keyboard, and put up a PR to improve the docs. So if you have any more questions ask away. A zstd decoder could be implemented using less memory, but we haven’t optimized for extremely low memory usage. For instance we have a 128KB buffer for literals to speed up decompression, but the literals could be streamed in an alternate implementation to save that 128KB at the cost of speed. That would get us down to ~32KB of memory, and that could be reduced slightly further. But speed would suffer significantly. |
What does "zstd requires to read 1 block at a time" mean? That the decoder requires a block be fully available prior to actually decoding it? that it cannot parse it incrementally? Does the decoder need random access into the block it is decoding? Or is it always sequential forward? I ask this, since building with DEBUGLEVEL=5 I see that I have a block in a sample that is 53KB long. By my understanding, you are saying that while the library wouldn't be required to buffer that, it does need to be buffered. If blocks must be decoded as contiguous segments, is there a way to reduce the generated block size during compression? Does a reduced block size lead to less compression than a larger block size? |
The decoder requires a full block to decompress the data. Blocks are, by default, the maximum block size of 128 KB. The zstd format breaks the block up into two segments, the literals, and the sequences. Each of those segments are read backwards. The zstd decoder implements this by requiring the whole block be contiguous in memory.
You can use |
Unfortunately, targetCBlockSize seems to completely choke the compressor to the point of being almost non-functional (see #2096). Playing around with adjusting ZSTD_BLOCKSIZE_MAX ( via ZSTD_BLOCKSIZELOG_MAX), it looks like I can accomplish the same net effect, with the caveat that it requires the compressor to adhere to the same build configuration. Other than being non-compliant with the standard, is there a downside to reducing ZSTD_BLOCKSIZE_MAX? |
Thanks for the bug report! We'll have to improve that for the next release.
That will work. The upside is that will shrink decompression memory usage proportional to the reduction in |
Gotcha. Gotta say, this is an amazing library y'all have built. I've previously used LZ4 and have been smashing my head against the hole that is high decompression performance (LZ4) versus high compression ratio with moderate performance (LZMA) trying to get better boot times but still fit the application. Zstd manages to hit all the checkboxes. Also as an FYI (not asking for changes), while the ZSTD_malloc/ZSTD_free allows for custom allocator overrides at run time, they don't quite solve the issue for minimal embedded systems or pre-boot environments, as it still requires the executable to link malloc and free. The way it's architected however does allow for a fairly trivial splicing out to remove that dependency. The reason why the linking of malloc and free are relevant is that those in turn cause a massive cascade of stdlib dependencies to be linked as well, much of which are probably not implemented. I realize that this is primarily a compression library built by the necessities of moving whole datacenters around, but as a general purpose compression format that intends to become the defacto standard for the next decade or more, it's something to at least have heard of. |
At some point we need to update the version of zstd in the kernel. At that point in time I want to spend some time to separate all of our stdlib dependencies into one place that can be replaced to make zstd easier to compile in freestanding mode. So it is definitely on the todo list. |
I am also having trouble with the memory consumption during streaming decompression. I found that |
One of the more important budgets for streaming decompression is the history buffer, Level 2, by default, will target a 1 MB history buffer. In order to reduce memory usage during streaming, it's possible to modify history size to some smaller value, like 512 KB for example, or even 256 KB. This must be done at compression stage, using this advanced parameter. This reduced memory budget generally comes at a loss of compression ratio, although, depending on content to compress, the ratio decrease may in fact be not that large, hence worth the trade off. It's not uncommon for applications managing thousands of streams in-flight to reduce history window size, for the purpose of keeping memory usage under some manageable level. |
Main questions:
Context:
I'm working on a new generation bootloader for an embedded device and would like to use zstd for image compression. Going through all the documentation and API, I feel like I'm in a bit of a hole. For context, I have two main blocks of memory I can use: main memory 2-32MB, and internal SRAM 32-512kB.
The image is read over a non-memory mapped serial stream. The destination, however, is a fixed block buffer.
What is unclear is whether the decompressor needs to reuse previously read input data. If it does not, then it should be possible to stream decompress without the intermediate window buffer.
I think I can solve my issue by doing an in-place decompression akin to how it is done in the Linux Kernel. However, there is a non-negligible performance penalty for this, as this potentially doubles the the bandwidth used for main memory, and introduces yet another copy-then-move operation on startup.
The text was updated successfully, but these errors were encountered: