-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming decompression corner case #340
Comments
I don't believe that the recommended input size can be bigger than Corner case 1 is interesting. I think the return value is unambiguous, assuming the caller of I think Corner case 2 is happening because there isn't enough space in the output buffer. If your output buffer is of size In Corner case 3 the return code is 3 because there are still 3 bytes left in the input buffer that it hasn't read yet, presumably because it fills the output buffer up each time with earlier data. |
@terrelln , my main problem is the first case, the other two can be fixed with better documentation. Regarding the first case, it's still ambiguous as you may have multiple frames in one stream so you can get read request of size 1 before the input stream is finished if there is another frame in it. |
My workaround is to read 9 bytes on the first read so when it request reading 1 byte it will be already in the input buffer and the read can be skipped . |
Many good points @luben, let's try to answer or fix them one by one :
You are right that return code 1 can have multiple meanings.
In 99% of situation, the first interpretation is the correct one. I consider the second one as a specific corner case. There are 2 ways to use
In case 1, the exact return value has no importance. It only matters if it is In case 2, it can be a problem, because There are 2 ways to get around this problem :
I understand it's not ideal to have value |
Yes, but another condition is that the output can be flushed in one go too. Otherwise, the algorithm will stop right after decoding a block, |
The last 3 bytes correspond to the header of the following block.
The general idea in streaming decompression API is that if return code Any other value |
Hi, @Cyan4973, first of all, thanks for your detailed explanation. What value is expected to return for skippable frame of ZSTD_decompressStream calls? |
Yes, Skippable Frame is a special corner case. |
Expected read size is guaranteed to be < Note that the function result is an "hint", not a request.
Sounds reasonable. I'll have a look into it.
Sounds reasonable too.
This one is less obvious. What objective does it serve ? |
Then how should we deal with that? Supply input chunk by chunk as chunk size of ZSTD_DStreamInSize()(which most likely be luben's case)? That will require many wasted buffer copy actions if skippable frame's content size is large, 1G for example. |
This is possible. If you detect that input is a skippable frame, Then, on the application side, it's necessary to skip the entire skippable frame, |
On the case 1. I am trying to follow the size hints. It looks like providing the input up to When following the size hint the proposed solutions are:
it's user supplied buffer, I don't decide its size.
Seems workable but complex. My current workaround is to provide the first read as 9 bytes (header frame + 1) so that I can skip reading the 1 byte that comes just after the frame header on the ground that it's already in the source buffer. Is there any other case when it may need to read an 1 byte from the source? Is it possible to have a block with size 1 byte? The other 2 cases may need just improved documentation. As well mentioning the skippable frames.
The objective is to be able to skip some of the data in the input stream, kind a fast-forward. I imagine that it can be implemented more efficiently by the compression library because the JNI wrapper don't need to allocate a throw-away buffer as the output can just be decompressed and discarded in the already allocated ZSTD_outBuffer space. Regarding the skippable frames. If I am just supplying the input in sizes of ZSTD_DStreamInSize is the ZSTD_decompressStream will handle them correctly? |
There is a special kind of block (rle) which only uses 1 byte in compressed format.
Yes. |
Great. So I will experiment a little bit more with supplying the buffer in ZSTD_DStreamInSize as it seems will simplify the handling and release the 1.0.0 jni bindings later this week. |
Note that frame header has a variable size, which makes it difficult to know how many bytes must be read to reach first block. It requires a minimum of 5 bytes to know the size of the frame header (which can be any value between 6 and 18 bytes). Frame header is necessarily followed by at least a block size header (+3 bytes). So, it seems the more logical is to ask for these first 5 bytes. |
I have implemented it as suggested but had to combine the two proposed approaches.
I like the approach as it is more simple and saves of source reads. Thanks for the suggestions |
For suggestion 2, I'm trying to find an API which feels logical. Knowing the amount of data left inside internal buffers is possible on the compression side, because there is the function There is no exact equivalent on the decoder side. One special value can be deducted : zero. If the output buffer is not entirely filled, it means there is no more data left inside internal buffers. |
Following a few discussions, the behavior of the streaming decompression API has been slightly amended : On reaching the end of (compressed) input, and if the output has not been completely flushed, This will make it possible for API users to rely on the assumption that if the compressed stream has been entirely consumed, then everything is finished, which is apparently a natural expectation. "1" will mean "please one more byte" in all circumstances, no exception. Everything else is the same, so if your streaming decompression implementation already works, it will also work identically on next version. |
what does that mean? Do you mean the input will not be fully consumed(one byte left when not fully flushed), and API users should always check input consumption? One code example would be really appreciated. |
Yes, it will pretend so
No, current behavior (check that return code is This is meant for users which skip return code and only check if input is fully consumed. It's not the official way, but I got too many feedbacks on this to ignore them. For such users, the API will now work correctly too.
see examples/streaming_decompression.c |
Thanks. |
Thanks for the change. It will save on the special handling of return code 1. |
New streaming decompression API behaviour implemented in v1.1.0. |
Hi,
I have pretty complete decompression implementation of the JNI bindings using the new streaming API but I hit some corner cases and have some questions and suggestions (using code at 4798793).
Corner case I: return code 1 from ZSTD_decompressStream can mean multiple things. If a file is compressed with
cat file | zstd --no-check -c > file.zst
then the requested read after the frame header is of size 1. This means that we cannot skip requested reads of size 1 because the decompression will not make progress beyond the frame header. On another side if we don't skip them we may try to read beyond the end of the input file.Corner case II: I was expecting that if we provide the source in chunks with the size requested by the previous invocation of
ZSTD_decompressStream
they will be consumed in one go and there will not be a need to provide them again (see the annotated trace below).Corner case III: If the output size is significantly smaller than 128k if the return code of
ZSTD_decompressStream
is 3 it means that it has more data it its output buffer.Here is a trace of the first several iterations for illustration:
And now on the questions and suggestions:
Question: Can the requested read size (the return of the
ZSTD_decompressStream
exceed the recommended src buffer size (the return ofZSTD_DStreamInSize
)?Suggestion 1: Can we make the return code of
ZSTD_initDStream
be the size of the first requested source buffer to read, it will save one iteration as currently it returns 0 on success.Suggestion 2:: Provide an API call to tell how much unconsumed data there is in ZSTD output buffer - this will make more accurate the implementation of the
available()
method in InputStream-s.Suggestion 3: Provide API call to skip on the decompressed data. Currently I am implementing it in terms of
ZSTD_decompressStream
but it needs allocating a throw-away buffer.Thanks in advance and congrats for the release of 1.0.0
CC: @advancedxy , luben/zstd-jni#8
The text was updated successfully, but these errors were encountered: