`zarr-python` v3 compatibility #516

mpiannucci · 2024-10-10T17:33:36Z

So far the only tested file type is HDF5. Thats the only module that currently works with new zarr python in some way

martindurant

There's much less change here than I might have thought.

kerchunk/tests/test_hdf.py

kerchunk/hdf.py

martindurant · 2024-10-10T17:49:06Z

kerchunk/hdf.py

@@ -496,6 +516,8 @@ def _translator(
                        if h5obj.fletcher32:
                            logging.info("Discarding fletcher32 checksum")
                            v["size"] -= 4
+                        key =  str.removeprefix(h5obj.name, "/") + "/" + ".".join(map(str, k))


This is the same as what _chunk_key did? Maybe make it a function with a comment saying it's a copy/reimplementation.

By the way, is h5obj.name not actually a string, so you could have done h5obj.name.removeprefix()?

martindurant · 2024-10-10T17:50:41Z

kerchunk/hdf.py

                    shape=h5obj.shape,
                    dtype=dt or h5obj.dtype,
                    chunks=h5obj.chunks or False,
                    fill_value=fill,
-                    compression=None,
+                    compressor=None,


So here, you could reintroduce the compressor

filters = filters[:-1] compressor = filters[-1]

but obviously it depends on whether there are indeed any filters at all.

It would still need back compat, since filters-only datasts definitely exist.

yeah the big issue is that v3 cares about what type of operation it is, and v2w doesnt so moving them around doesnt necessarily fix that bug

So there needs to be a change upstream?

Yes this: zarr-developers/zarr-python#2325

martindurant · 2024-10-10T17:51:36Z

kerchunk/hdf.py

+            for k, v in self.store_dict.items():
+                if isinstance(v, zarr.core.buffer.cpu.Buffer):
+                    key = str.removeprefix(k, "/")
+                    new_keys[key] = v.to_bytes()
+                    keys_to_remove.append(k)
+            for k in keys_to_remove:
+                del self.store_dict[k]


This is the hacky bit and could use some explanations. Even when requesting "v2", zarr makes Buffer objects, and the keys are also wrong?

Yeah so two issues here:

the keys we get from hdf are for example /depth/.zarray when then need to be depth/.zarray

we cant jsonify buffers, which is how the internal MemoryStore in v3 stores its data. So we need to convert the buffers to bytes to be serialized

OK - would appreciate comments on the code saying this.

kerchunk/hdf.py

kerchunk/tests/test_hdf.py

mpiannucci · 2024-10-15T18:37:23Z

Also worth noting... zarr 3 doesn't support numcodecs codecs out of the box. There is a pr to help this zarr-developers/numcodecs#524 (see also here for updated version zarr-developers/numcodecs#597) but it would mean a change to codec names which causes an incompatibility. For the initial icechunk examples we handle this in virtualizarr but long term it probably belongs here to work standalone.

mpiannucci · 2024-10-21T13:18:19Z

When this zarr-developers/zarr-python#2425 goes in it should unblock this to work full with zarr python v3.

We will also need to create both numcodecs and zarr v3 codec versions of all the custom kerchunk codecs so that a given dataset can be loaded in either v2 or v3 contexts (say if you kerchunk a grib file, then want to convert those references to an icechunk store)

martindurant · 2024-10-21T13:20:47Z

create both numcodecs and zarr v3 codec versions of all the custom kerchunk codecs

Is that a subclass in each case, specifying that it is to be bytes->bytes?

mpiannucci · 2024-10-21T13:22:52Z

Sorry the numcodecs versions exists as they are today. Yes the v3 version would be basically subclassing zarr.abc.codec using the numcodecs implementations of the kerchunk codecs.

Although grib decoder should really be bytes to array i think

mpiannucci · 2024-10-23T12:52:00Z

This is required upstream: fsspec/filesystem_spec#1734

mpiannucci · 2024-10-23T13:30:22Z

Also of note: To get this to work with zarr 3, we pass an fsspec ReferenceFilesystem to a zarr RemoteStore. This works fine with remote filesystems (where the data files live) that have async implementations (s3, http, etc) but does not work when the data files are on filesystem with only a sync implementation (local, etc). The tests heavily depend on the local filesystem, and i'm sure many others do as well so it needs to be figured out how the interaction of sync filesystems and the async zarr RemoteStore requirement work out

mpiannucci · 2024-10-23T14:18:24Z

Grib works as long as read in zarr 2 format.

To be used with zarr 3 the codec needs to be ported over to the new abc.Codec class from zarr

implement zarr3 support for grib

mpiannucci · 2024-11-26T20:37:55Z

State of the Union post merge with @moradology and installing latest fsspec and zarr 3

pip install git+https://github.com/fsspec/filesystem_spec

Test File	Fail?	Notes
`test_grib.py`	✅	datatree not tested
`test_netcdf.py`	3 Failures	test depend on memory which doesnt work with reference fs
`test_hdf.py`	5 Failures	Remote refs work fine
`test_df.py`	✅
`test_utils.py`	2 Failures
`test_fits.py`	5 Failures	`TypeError: Item key has incorrect type (expected slice, got int)`
`test_zarr.py`	2 Failures
`test_combine.py`	All Failed
`test_combine_concat.py`	8 Failures
`test_combine_dask.py`	3 Failures
`test_xarray_backend.py`	1 Failure	`TypeError: Item key has incorrect type (expected slice, got int)`

A lot of these are the same errors over and over a again. The hardest part will be maintaining compat for zarr python 2 library which I am not sure if it should be a goal of this PR @martindurant

maxrjones · 2025-01-11T13:55:21Z

Thanks for working on this compatibility! I really appreciate all the efforts you've put into this already. My understanding is that there's still a bit of work to do. Even though we all know that Matt's super trustworthy, some concerns were raised in the VirtualiZarr coordination meeting yesterday about how the dependency conflicts between icechunk, virtualizarr, zarr, and kerchunk seem more urgent with Zarr v3.0 out and it being sketchy to recommend installing a library from a branch off a fork. I wanted to float the idea of either moving Matt's PR to a branch within fsspec/kerchunk or adding a fsspec/kerchunk@zarr-v3 branch to mitigate these concerns until zarr-python v3 compatibility is completed. what do y'all think?

moradology · 2025-01-13T18:47:16Z

Makes sense to me. Perhaps a similar target branch should be added across the impacted repos (assuming that doesn't increase the cognitive load too much)

martindurant · 2025-01-13T19:13:45Z

It is still the case that kerchunk can be used to generate v2 output with zarr-3 installed, no?
Currently, the pyproject requirement has zarr<3, so in practice, kerchunkers should not be using the newly-released one, but the output ought to still be readable by it.

Did I misunderstand something? Please invite me to the coordination meeting, so we can decide who does what and when.

martindurant · 2025-01-13T19:27:59Z

I see I am wrong, and this branch cannot be expected to work with v2 zarr. So the plan is for all future kerchunking to happen in v3 (but perhaps still produce v2 output, so that people who don't upgrade can still use it) ?

mpiannucci · 2025-01-13T19:37:55Z

That is correct

maxrjones · 2025-01-13T20:17:51Z

Please invite me to the coordination meeting, so we can decide who does what and when.

@martindurant the coordination meeting info is at https://zarr.dev/community-calls/. The event is hosted on Zarr's community calendar rather than a regular calendar invite so it's not tied to any single individual/institution.

martindurant · 2025-01-14T14:18:05Z

OK, so I can work on this, if you have no time @mpiannucci . I will go with the plan, that kerchunk will require zarr>3, but produce v2 output (because we don't actually use any new functionality), and for combining and zarr-to-zarr, the input has to be v2 format.
zarr>3 also puts a condition on the minimum python version, I think >=3.11.

mpiannucci · 2025-01-14T14:19:45Z

That sounds great @martindurant . I do not have the capacity to code on it at the moment, but can help with any questions or if you get stuck to keep things moving.

mpiannucci added 10 commits October 4, 2024 16:37

Save progress for next week

39722e7

Bump zarr python version

d3c7e37

Get some tests working others failing

25d7d14

get through single hdf to zarr

ffe5f9d

Save progress

5aef233

Cleanup, almost working with hdf

b9323d2

Closer...

0f17119

Updating tests

5c8806b

reorganize

80fedcd

Save progress

1f69a0b

martindurant reviewed Oct 10, 2024

View reviewed changes

kerchunk/tests/test_hdf.py Outdated Show resolved Hide resolved

mpiannucci added 5 commits October 10, 2024 15:30

Refactor to clean things up

d556e52

Fix circular import

b27e64c

Iterate

41d6e8e

Change zarr dep

7ade1a6

More conversion

492ddee

mpiannucci mentioned this pull request Oct 12, 2024

Virtual Dataset Workflow Tracking Issue earth-mover/icechunk#197

Open

5 tasks

Specify zarr version

6e5741c

TomNicholas mentioned this pull request Oct 17, 2024

Make kerchunk dependency entirely optional zarr-developers/VirtualiZarr#258

Closed

mpiannucci added 3 commits October 23, 2024 09:31

Working remote hdf tests

c0316ac

Working grib impl

59bd36c

Add back commented out code

187ced2

moradology and others added 2 commits November 21, 2024 16:24

Use zarr3 stores directly; avoid use of internal fs

3757199

Merge pull request #4 from moradology/fix/zarr3-grib-tests

9444ff8

implement zarr3 support for grib

mpiannucci added 12 commits November 26, 2024 16:25

Forward

d8848ce

More

1fa294e

Figure out async wrapper

543178d

Closer on hdf5

96b56cd

netcdf but failing

0808b05

grib passing

aef006e

Fix inline test

d9bf0dd

More

884fc68

standardize compressor name

1145f45

Fix one more hdf test

94ec479

Small tweaks

a9693d1

Hide fsspec import where necessary

7e9112a

maxrjones mentioned this pull request Nov 29, 2024

Dependency Issue for Kerchunk -> Icechunk via Virtualizarr zarr-developers/VirtualiZarr#321

Open

This was referenced Dec 4, 2024

Wrap sync fs for xarray.to_zarr zarr-developers/zarr-python#2533

Open

Fix most combine tests mpiannucci/kerchunk#5

Open

maxrjones mentioned this pull request Jan 14, 2025

Vendor kerchunk readers? zarr-developers/VirtualiZarr#377

Open

martindurant added 3 commits January 16, 2025 09:52

Update with many fixes - but stioll not complete

a7af691

Merge branch 'main' into v3

f7b87de

min python

95f340f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`zarr-python` v3 compatibility #516

`zarr-python` v3 compatibility #516

mpiannucci commented Oct 10, 2024 •

edited

Loading

martindurant left a comment

martindurant Oct 10, 2024

martindurant Oct 10, 2024

mpiannucci Oct 10, 2024

martindurant Oct 15, 2024

mpiannucci Oct 15, 2024

martindurant Oct 10, 2024

mpiannucci Oct 10, 2024

martindurant Oct 15, 2024

mpiannucci commented Oct 15, 2024 •

edited

Loading

mpiannucci commented Oct 21, 2024 •

edited

Loading

martindurant commented Oct 21, 2024

mpiannucci commented Oct 21, 2024 •

edited

Loading

mpiannucci commented Oct 23, 2024

mpiannucci commented Oct 23, 2024 •

edited

Loading

mpiannucci commented Oct 23, 2024

mpiannucci commented Nov 26, 2024 •

edited

Loading

maxrjones commented Jan 11, 2025

moradology commented Jan 13, 2025

martindurant commented Jan 13, 2025

martindurant commented Jan 13, 2025

mpiannucci commented Jan 13, 2025

maxrjones commented Jan 13, 2025

martindurant commented Jan 14, 2025

mpiannucci commented Jan 14, 2025

zarr-python v3 compatibility #516

Are you sure you want to change the base?

zarr-python v3 compatibility #516

Conversation

mpiannucci commented Oct 10, 2024 • edited Loading

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpiannucci commented Oct 15, 2024 • edited Loading

mpiannucci commented Oct 21, 2024 • edited Loading

martindurant commented Oct 21, 2024

mpiannucci commented Oct 21, 2024 • edited Loading

mpiannucci commented Oct 23, 2024

mpiannucci commented Oct 23, 2024 • edited Loading

mpiannucci commented Oct 23, 2024

mpiannucci commented Nov 26, 2024 • edited Loading

maxrjones commented Jan 11, 2025

moradology commented Jan 13, 2025

martindurant commented Jan 13, 2025

martindurant commented Jan 13, 2025

mpiannucci commented Jan 13, 2025

maxrjones commented Jan 13, 2025

martindurant commented Jan 14, 2025

mpiannucci commented Jan 14, 2025

`zarr-python` v3 compatibility #516

`zarr-python` v3 compatibility #516

mpiannucci commented Oct 10, 2024 •

edited

Loading

mpiannucci commented Oct 15, 2024 •

edited

Loading

mpiannucci commented Oct 21, 2024 •

edited

Loading

mpiannucci commented Oct 21, 2024 •

edited

Loading

mpiannucci commented Oct 23, 2024 •

edited

Loading

mpiannucci commented Nov 26, 2024 •

edited

Loading