Zstd compression does not encode content size in header #182

mkitti · 2024-07-26T01:53:46Z

The Zstd writer implemented here is based on the Zstd streaming API. When encoding a chunk, set pledged size is not used. This means the frame content size is not encoded in the Zstd frame header.

Some Zstd decoding implementations such as numcodecs.js and Zarr numcodecs rely upon ZSTD_getFrameContentSize() to allocate a decompression buffer.

https://github.com/zarr-developers/numcodecs/blob/main/numcodecs%2Fzstd.pyx#L182-L184

As written here, chunks encoded by Zstd via Tensorstore will return ZSTD_CONTENTSIZE_UNKNOWN from ZSTD_getFrameContentSize().

xref: google/neuroglancer#625

The text was updated successfully, but these errors were encountered:

mkitti · 2024-07-26T04:52:58Z

Here's an illustration of saving a zarr array with tensorstore using zstd compression. python-zstandard is unable to open the chunk unless max_output_size is provided.

In [1]: import tensorstore as ts, zstandard as zstd

In [2]: ds = ts.open({
   ...:     'driver': 'zarr',
   ...:     'kvstore': {
   ...:         'driver': 'file',
   ...:         'path': 'tmp/zarr_zstd_dataset',
   ...:     },
   ...:     'metadata': {
   ...:         'compressor': {
   ...:             'id': 'zstd',
   ...:             'level': 3,
   ...:         },
   ...:         'shape': [1024, 1024],
   ...:         'chunks': [64, 64],
   ...:         'dtype': '|u1',
   ...:     }
   ...: }).result()

In [3]: ds[:] = 5

In [4]: with open("tmp/zarr_zstd_dataset/0/0", "rb") as f:
   ...:     src = f.read()
   ...: 

In [5]: zstd.backend_c.frame_content_size(src)
Out[5]: -1

In [6]: zstd.ZstdDecompressor().decompress(src)
---------------------------------------------------------------------------
ZstdError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 zstd.ZstdDecompressor().decompress(src)

ZstdError: could not determine content size in frame header

In [7]: zstd.ZstdDecompressor().decompress(src, max_output_size=1024*1024)
Out[7]: b'\x05\x05\x05\x05\x05\x05\x05\x05\x05 [...] \x05\x05\x05 '

For an example of being unable to open the dataset with zarr-python see zarr-developers/zarr-python#2056

jbms · 2024-07-26T05:07:21Z

Thanks for investigating this so thoroughly!

We can probably ensure that tensorstore includes the uncompressed size in the header in this case, but in general there could be multiple variable-output-size codecs chained and it is desirable to be able to do streaming encoding.

Therefore in addition to that, other implementations should still support decoding without the size in the header.

mkitti changed the title ~~Zstd compression does not encode content size in~~ Zstd compression does not encode content size in header Jul 26, 2024

mkitti mentioned this issue Jul 26, 2024

zarr-python cannot read arrays saved by tensorstore using the zstd compressor zarr-developers/zarr-python#2056

Open

mkitti mentioned this issue Aug 15, 2024

fix(zstd): Upgrade numcodecs.js to 0.3.2 for Zstd streaming decompression google/neuroglancer#639

Merged

mkitti mentioned this issue Nov 18, 2024

Update Zstd decompression for "unknown decompressed size" when streaming API was used for compression HDFGroup/hdf5_plugins#116

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zstd compression does not encode content size in header #182

Zstd compression does not encode content size in header #182

mkitti commented Jul 26, 2024 •

edited

Loading

mkitti commented Jul 26, 2024

jbms commented Jul 26, 2024

Zstd compression does not encode content size in header #182

Zstd compression does not encode content size in header #182

Comments

mkitti commented Jul 26, 2024 • edited Loading

mkitti commented Jul 26, 2024

jbms commented Jul 26, 2024

mkitti commented Jul 26, 2024 •

edited

Loading