Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] V2 Codec pipeline is not consistent with legacy usage of filters #2325

Closed
Tracked by #2412
mpiannucci opened this issue Oct 9, 2024 · 8 comments · Fixed by #2425
Closed
Tracked by #2412

[v3] V2 Codec pipeline is not consistent with legacy usage of filters #2325

mpiannucci opened this issue Oct 9, 2024 · 8 comments · Fixed by #2425
Labels
bug Potential issues with the zarr-python library
Milestone

Comments

@mpiannucci
Copy link

Zarr version

3.0.0a8

Numcodecs version

0.13.0

Python Version

3.11

Operating System

Mac

Installation

pip in virtual environment

Description

I am reading a kerchunk reference filesystem as a zarr v2 store with zarr python. The entire reference file is attached, but an example .zarray is as follows:

    {
        "shape": [
            1
        ],
        "fill_value": null,
        "zarr_format": 2,
        "order": "C",
        "filters": [
            {
                "id": "zlib",
                "level": 2
            }
        ],
        "dimension_separator": ".",
        "compressor": null,
        "chunks": [
            1
        ],
        "dtype": "<i4"
    }

Notably, the filters contains zlib which is a BytesBytesCodec in zarr-python v3+. The issue comes, when we look at the codec pipeline created for V2 arrays:

def create_codec_pipeline(metadata: ArrayV2Metadata | ArrayV3Metadata) -> CodecPipeline:
if isinstance(metadata, ArrayV3Metadata):
return get_pipeline_class().from_codecs(metadata.codecs)
elif isinstance(metadata, ArrayV2Metadata):
return get_pipeline_class().from_codecs(
[V2Filters(metadata.filters), V2Compressor(metadata.compressor)]
)
else:
raise TypeError

This defines the pipeline as two codecs, filters and compressor. The problem here is that V2Filters is defined as a ArrayArrayCodec and V2Compressor is defined as a ArrayBytesCodec. Because of this, all codecs defined in the metadata as filters are expected to be ArrayArrayCodecs and applied once the buffer is translated to an array. Further, the compressor can only define one codecs that is an ArrayBytesCodec, which leaves no place to define a BytesBytesCodec in v2 metadata.

With the current .zarray above, the pipeline crashes because the zlib codec outputs bytes and not an array as is expected.

test_dict.json

cc @jhamman @martindurant

Steps to reproduce

pip install git+https://github.com/TomAugspurger/zarr-python@xarray-compat git+https://github.com/TomAugspurger/xarray/@fix/zarr-v3
import json
import fsspec
import xarray as xr
import zarr

test_dict = json.load('test_dict.json')
fs = fsspec.implementations.reference.ReferenceFileSystem(fo=test_dict, remote_options=remote_options)
store =  zarr.storage.RemoteStore(fs, mode="r")

# This will crash in the codec pipeline
ds = xr.open_dataset(store, engine="zarr", zarr_format=2, backend_kwargs=dict(consolidated=False))

Additional output

No response

@mpiannucci mpiannucci added the bug Potential issues with the zarr-python library label Oct 9, 2024
@jhamman
Copy link
Member

jhamman commented Oct 9, 2024

also cc @normanrz who knows the v2 pipeline path the best

@normanrz
Copy link
Member

Thanks for reporting this. I wasn't aware, that zarr-python 2 worked with using compressors, such as zlib, as filter. That will need some workarounds in zarr-python 3.

@martindurant
Copy link
Member

In v2, there is no conceptual difference between filters and codecs. To reiterate, the pipeline actually run is filers + ([] if compression is None else [compression]), at least in python. This is why the numcodecs classes don't specify if they are compression, or bytes-to-bytes or whatever, but they take and produce "buffers" which may be numpy or bytes/memoryview.

@rabernat
Copy link
Contributor

Should we bring this API back to V3 as a convenience / compatibility layer?

We could inspect the arguments to filters + compressors and translate them into a proper V3 codec pipeline.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 10, 2024

i'm not sure it is generally possible to map a v2 ([filters], compressor) setup into v3, but we could define a v2-specific codec pipeline that places no restrictions on the order of its contents, and otherwise adheres to the same protocol (i.e., the same behavior) as the v3 codec pipeline. I can't wait to see if someone tries to back-port sharding into v2 using this mechanism.

@rabernat
Copy link
Contributor

In terms of breaking API changes, this is one of the biggest ones from V2 to V3. Folks who depend on zarr python have existing code that creates compressors and filters.

i'm not sure it is generally possible to map a v2 ([filters], compressor) setup into v3

We could try!

codecs = [v2_to_v3_codec(filt) for filt in filters] + [v2_to_v3_codec(compressor)]

I can't wait to see if someone tries to back-port sharding into v2 using this mechanism.

Let's leave that out of scope.

@martindurant
Copy link
Member

codecs = [v2_to_v3_codec(filt) for filt in filters] + [v2_to_v3_codec(compressor)]

(with check for filters or compressor being None)

This will work. If there's some dataset that defines these such that an array codec tries to act on the output of a bytes codec, we can warn maybe, and understand that they were relying on bytes buffers even where we think a codec produces arrays.

@normanrz normanrz modified the milestones: 3.0.0.beta, 3.0.0 Oct 11, 2024
@jhamman jhamman added the V3 label Oct 18, 2024
@jhamman jhamman changed the title V2 Codec pipeline is not consistent with legacy usage of filters [v3] V2 Codec pipeline is not consistent with legacy usage of filters Oct 24, 2024
@jhamman
Copy link
Member

jhamman commented Oct 24, 2024

closed by #2325

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants