Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming decompression corner case #340

Closed
luben opened this issue Sep 4, 2016 · 23 comments
Closed

Streaming decompression corner case #340

luben opened this issue Sep 4, 2016 · 23 comments

Comments

@luben
Copy link
Contributor

luben commented Sep 4, 2016

Hi,

I have pretty complete decompression implementation of the JNI bindings using the new streaming API but I hit some corner cases and have some questions and suggestions (using code at 4798793).

Corner case I: return code 1 from ZSTD_decompressStream can mean multiple things. If a file is compressed with cat file | zstd --no-check -c > file.zst then the requested read after the frame header is of size 1. This means that we cannot skip requested reads of size 1 because the decompression will not make progress beyond the frame header. On another side if we don't skip them we may try to read beyond the end of the input file.

Corner case II: I was expecting that if we provide the source in chunks with the size requested by the previous invocation of ZSTD_decompressStream they will be consumed in one go and there will not be a need to provide them again (see the annotated trace below).

Corner case III: If the output size is significantly smaller than 128k if the return code of ZSTD_decompressStream is 3 it means that it has more data it its output buffer.

Here is a trace of the first several iterations for illustration:

Request: 32768 from 0
 --
 toRead: 0
 dstSize: 32768 dstPos: 0
 srcSize: 0 srcPos: 0
 decompressStream(...)
 dstSize: 32768 dstPos: 0
 srcSize: 0 srcPos: 0
 --
 toRead: 8                         <- the return code of decompressStream
 dstSize: 32768 dstPos: 0
 srcSize: 0 srcPos: 0
 Reading 8
 src consumed up to: 8
 srcSize: 8 srcPos: 0
 decompressStream(...)
 dstSize: 32768 dstPos: 0
 srcSize: 8 srcPos: 8
 --
 toRead: 1                          <- we can't skip it
 dstSize: 32768 dstPos: 0
 srcSize: 8 srcPos: 8
 Reading 1
 src consumed up to: 9
 srcSize: 9 srcPos: 8
 decompressStream(...)
 dstSize: 32768 dstPos: 0
 srcSize: 9 srcPos: 9
 --
 toRead: 12236
 dstSize: 32768 dstPos: 0
 srcSize: 9 srcPos: 9
 Reading 12236
 src consumed up to: 12245
 srcSize: 12245 srcPos: 9
 decompressStream(...)
 dstSize: 32768 dstPos: 32768
 srcSize: 12245 srcPos: 12242     <- the last 3 bytes were not consumed
 return: 32768                         though we provide the requested size

Request: 32768 from 32768
 --
 toRead: 3                         <- zstd has 96K more in his output buffer 
 dstSize: 65536 dstPos: 32768          why the return code is not 1
 srcSize: 12245 srcPos: 12242
 decompressStream(...)
 dstSize: 65536 dstPos: 65536
 srcSize: 12245 srcPos: 12242
 return: 32768

Request: 32768 from 65536
 --
 toRead: 3
 dstSize: 98304 dstPos: 65536
 srcSize: 12245 srcPos: 12242
 decompressStream(...)
 dstSize: 98304 dstPos: 98304
 srcSize: 12245 srcPos: 12242
 return: 32768

And now on the questions and suggestions:

Question: Can the requested read size (the return of the ZSTD_decompressStream exceed the recommended src buffer size (the return of ZSTD_DStreamInSize)?

Suggestion 1: Can we make the return code of ZSTD_initDStream be the size of the first requested source buffer to read, it will save one iteration as currently it returns 0 on success.

Suggestion 2:: Provide an API call to tell how much unconsumed data there is in ZSTD output buffer - this will make more accurate the implementation of the available() method in InputStream-s.

Suggestion 3: Provide API call to skip on the decompressed data. Currently I am implementing it in terms of ZSTD_decompressStream but it needs allocating a throw-away buffer.

Thanks in advance and congrats for the release of 1.0.0

CC: @advancedxy , luben/zstd-jni#8

@terrelln
Copy link
Contributor

terrelln commented Sep 5, 2016

I don't believe that the recommended input size can be bigger than ZSTD_DStreamInSize() for decompression, but I'm not certain.

Corner case 1 is interesting. I think the return value is unambiguous, assuming the caller of ZSTD_decompressStream knows when there is no more input. But, it is certainly confusing. I think it should probably be noted in the comments that the return value can still be 1 when there is still input to read.

I think Corner case 2 is happening because there isn't enough space in the output buffer. If your output buffer is of size ZSTD_DStreamOutSize(), I wonder if it would still happen.

In Corner case 3 the return code is 3 because there are still 3 bytes left in the input buffer that it hasn't read yet, presumably because it fills the output buffer up each time with earlier data.

@luben
Copy link
Contributor Author

luben commented Sep 5, 2016

@terrelln , my main problem is the first case, the other two can be fixed with better documentation.

Regarding the first case, it's still ambiguous as you may have multiple frames in one stream so you can get read request of size 1 before the input stream is finished if there is another frame in it.

@luben
Copy link
Contributor Author

luben commented Sep 5, 2016

My workaround is to read 9 bytes on the first read so when it request reading 1 byte it will be already in the input buffer and the read can be skipped .

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 5, 2016

Many good points @luben, let's try to answer or fix them one by one :

Corner case I: return code 1 from ZSTD_decompressStream can mean multiple things. If a file is compressed with cat file | zstd --no-check -c > file.zst then the requested read after the frame header is of size 1. This means that we cannot skip requested reads of size 1 because the decompression will not make progress beyond the frame header. On another side if we don't skip them we may try to read beyond the end of the input file.

You are right that return code 1 can have multiple meanings.
It means either that :

  • there is still some data to flush, but after that, streaming doesn't expect any more input.
  • streaming would like to read one more byte to move to next stage.

In 99% of situation, the first interpretation is the correct one. I consider the second one as a specific corner case.

There are 2 ways to use ZSTD_decompressStream() return code :

  • just check if return == 0, to know when a frame is completed. Effectively, do not take the size hint into consideration, just throw at decompressStream whatever input is available.
  • check return value and follow scrupulously its instruction, by providing the number of input bytes requested.

In case 1, the exact return value has no importance. It only matters if it is ZSTD_isError(), 0, or >0. So the double meaning of value 1 is of no importance.

In case 2, it can be a problem, because 1 may be interpreted as "one more byte", which is incorrect, according to source documentation. 1 can mean "I expect nothing more, but still need to flush something".

There are 2 ways to get around this problem :

  • Use ZSTD_DStreamOutSize() to size the ouput buffer.
    In which case, the stream will always be able to decompress a full block.
    Therefore, since following the return value input size hint leads to decompressing blocks one by one, it's always guaranteed that the current block is fully flushed into output buffer. In which case, there is no situation where return value can be 1.
    This is the strategy used by fileio.
  • When return value is 1, start by flushing anything which may be left within internal buffers.
    Do not provide any more input.
    When flushing doesn't produce any more byte, and if return value is still 1, then it means the stream is really waiting for one more byte. Give it one more byte

I understand it's not ideal to have value 1 for 2 different purposes.
Unfortunately, I have not found a better idea (without introducing another write variable).
0 can't be used, since it also means "end of frame", which is the signal that a frame is fully decoded and fully flushed.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 5, 2016

Corner case II: I was expecting that if we provide the source in chunks with the size requested by the previous invocation of ZSTD_decompressStream they will be consumed in one go and there will not be a need to provide them again (see the annotated trace below).

Yes, but another condition is that the output can be flushed in one go too.

Otherwise, the algorithm will stop right after decoding a block,
and wait there until all output is flushed.
Only then it will resume.
Next stage is typically to read next block's header ( 3 bytes ).

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 5, 2016

srcSize: 12245 srcPos: 12242 <- the last 3 bytes were not consumed though we provide the requested size

The last 3 bytes correspond to the header of the following block.
They would have been read if the block had been fully flushed.
Since it's not, the decoder wait for all output to be flushed before moving to next stage (read next block's header).

toRead: 3 <- zstd has 96K more in his output buffer why the return code is not 1

The general idea in streaming decompression API is that if return code > 0, there is still some more work to do.
When > 0, the value is an hint to the number of bytes expected.
But 1 is an exception : the function will return 1 when it doesn't expect more input but still need to flush some output. So 1 must be dealt with special care.

Any other value > 0 means the decoder expects some more input, irrespective of the fact that some data remains to be flushed. So in this case, it says 3 because it knows that a block header must follow, but it will only start to read them after the previous block gets completely flushed.

@advancedxy
Copy link

Hi, @Cyan4973, first of all, thanks for your detailed explanation.

What value is expected to return for skippable frame of ZSTD_decompressStream calls?
Can it return values greater than ZSTD_DStreamInSize()?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 5, 2016

Can it return values greater than ZSTD_DStreamInSize()?

Yes, Skippable Frame is a special corner case.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 5, 2016

Question: Can the requested read size (the return of the ZSTD_decompressStream exceed the recommended src buffer size (the return of ZSTD_DStreamInSize)?

Expected read size is guaranteed to be < ZSTD_DStreamInSize()
with the exception of skippable frames, which can have any size, up to 4 GB - 1.

Note that the function result is an "hint", not a request.
Next input can present any size, it will still work.

Suggestion 1: Can we make the return code of ZSTD_initDStream be the size of the first requested source buffer to read, it will save one iteration as currently it returns 0 on success.

Sounds reasonable. I'll have a look into it.

Suggestion 2:: Provide an API call to tell how much unconsumed data there is in ZSTD output buffer - this will make more accurate the implementation of the available() method in InputStream-s.

Sounds reasonable too.

Suggestion 3: Provide API call to skip on the decompressed data. Currently I am implementing it in terms of ZSTD_decompressStream but it needs allocating a throw-away buffer.

This one is less obvious. What objective does it serve ?

@advancedxy
Copy link

with the exception of skippable frames, which can have any size, up to 4 GB - 1.

Then how should we deal with that? Supply input chunk by chunk as chunk size of ZSTD_DStreamInSize()(which most likely be luben's case)? That will require many wasted buffer copy actions if skippable frame's content size is large, 1G for example.
Can we skip this operation and just tell ZSTD_decompressStream() we have skipped all the data in the skippable frame?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 5, 2016

Can we skip this operation and just tell ZSTD_decompressStream() we have skipped all the data in the skippable frame?

This is possible.

If you detect that input is a skippable frame,
for example because it requests an input size > ZSTD_DStreamInSize(),
it's possible to "reset" the decoder anytime, using ZSTD_initDStream().

Then, on the application side, it's necessary to skip the entire skippable frame,
and start again decoding at the exact beginning of next frame.

@luben
Copy link
Contributor Author

luben commented Sep 5, 2016

On the case 1. I am trying to follow the size hints. It looks like providing the input up to ZSTD_DStreamInSize may be simpler, I will try it.

When following the size hint the proposed solutions are:

Use ZSTD_DStreamOutSize() to size the ouput buffer

it's user supplied buffer, I don't decide its size.

When return value is 1, start by flushing anything which may be left within internal buffers.

Seems workable but complex.

My current workaround is to provide the first read as 9 bytes (header frame + 1) so that I can skip reading the 1 byte that comes just after the frame header on the ground that it's already in the source buffer. Is there any other case when it may need to read an 1 byte from the source? Is it possible to have a block with size 1 byte?

The other 2 cases may need just improved documentation. As well mentioning the skippable frames.

Suggestion 3: Provide API call to skip on the decompressed data. Currently I am implementing it in terms of ZSTD_decompressStream but it needs allocating a throw-away buffer.

This one is less obvious. What objective does it serve ?

The objective is to be able to skip some of the data in the input stream, kind a fast-forward. I imagine that it can be implemented more efficiently by the compression library because the JNI wrapper don't need to allocate a throw-away buffer as the output can just be decompressed and discarded in the already allocated ZSTD_outBuffer space.

Regarding the skippable frames. If I am just supplying the input in sizes of ZSTD_DStreamInSize is the ZSTD_decompressStream will handle them correctly?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 5, 2016

Is it possible to have a block with size 1 byte?

There is a special kind of block (rle) which only uses 1 byte in compressed format.
But it's currently never generated. So it never happens.

Regarding the skippable frames. If I am just supplying the input in sizes of ZSTD_DStreamInSize is the ZSTD_decompressStream will handle them correctly?

Yes.

@luben
Copy link
Contributor Author

luben commented Sep 5, 2016

Great. So I will experiment a little bit more with supplying the buffer in ZSTD_DStreamInSize as it seems will simplify the handling and release the 1.0.0 jni bindings later this week.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 5, 2016

Can we make the return code of ZSTD_initDStream be the size of the first requested source buffer to read, it will save one iteration as currently it returns 0 on success.

Note that frame header has a variable size, which makes it difficult to know how many bytes must be read to reach first block.

It requires a minimum of 5 bytes to know the size of the frame header (which can be any value between 6 and 18 bytes). Frame header is necessarily followed by at least a block size header (+3 bytes).

So, it seems the more logical is to ask for these first 5 bytes.
They are enough to bootstrap the rest of the process.

@luben
Copy link
Contributor Author

luben commented Sep 5, 2016

I have implemented it as suggested but had to combine the two proposed approaches.

  1. I am reading the source in chunks of ZSTD_DStreamInSize but I still have to not read from source if the hint is 1. In general if the frame decompression is not finished I raise an exception when I can't read as this may mean broken file.
  2. I am tracking the decompression progress as there may still be cases when only 1 byte read is requested, e.g. when the first frame is of size ZSTD_DStreamInSize()-8.

I like the approach as it is more simple and saves of source reads.

Thanks for the suggestions

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 6, 2016

For suggestion 2, I'm trying to find an API which feels logical.

Knowing the amount of data left inside internal buffers is possible on the compression side, because there is the function ZSTD_flushStream(), which return code tells how many bytes are left.

There is no exact equivalent on the decoder side.
The almost equivalent is to use ZSTD_decompressStream() with an empty input, but the return code won't tell how much data is left inside internal buffers.

One special value can be deducted : zero. If the output buffer is not entirely filled, it means there is no more data left inside internal buffers.
If the output buffer is fully filled, there is likely some data left, but no indication of how much (note : it could be zero, in the special case when buffer has exactly the same size as remaining data).
This condition is a strong hint that another flush will be necessary though ....

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 9, 2016

Following a few discussions, the behavior of the streaming decompression API has been slightly amended :

On reaching the end of (compressed) input, and if the output has not been completely flushed,
it will pretend that it did not read the last byte, and keep asking for it, up to the moment that output is entirely flushed.

This will make it possible for API users to rely on the assumption that if the compressed stream has been entirely consumed, then everything is finished, which is apparently a natural expectation. "1" will mean "please one more byte" in all circumstances, no exception.

Everything else is the same, so if your streaming decompression implementation already works, it will also work identically on next version.

@advancedxy
Copy link

it will pretend that it did not read the last byte, and keep asking for it, up to the moment that output is entirely flushed.

what does that mean? Do you mean the input will not be fully consumed(one byte left when not fully flushed), and API users should always check input consumption?

One code example would be really appreciated.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 9, 2016

Do you mean the input will not be fully consumed

Yes, it will pretend so

API users should always check input consumption

No, current behavior (check that return code is 0) remains the official way to know when a frame is fully decoded.

This is meant for users which skip return code and only check if input is fully consumed. It's not the official way, but I got too many feedbacks on this to ignore them. For such users, the API will now work correctly too.

One code example would be really appreciated.

see examples/streaming_decompression.c

@advancedxy
Copy link

advancedxy commented Sep 9, 2016

see examples/streaming_decompression.c

Thanks.

@luben
Copy link
Contributor Author

luben commented Sep 9, 2016

Thanks for the change. It will save on the special handling of return code 1.

@Cyan4973
Copy link
Contributor

New streaming decompression API behaviour implemented in v1.1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants