Compatibility for zarr-python 3.x #9552

TomAugspurger · 2024-09-27T01:30:26Z

This PR begins the process of adding compatibility with zarr-python 3.x. It's intended to be run against zarr-python v3 + the open PRs referenced in #9515.

All of the zarr test cases should be parameterized by zarr_format=[2, 3] with zarr-python 3.x to exercise reading and writing both formats.

This is currently passing with zarr-python==2.18.3. ~~zarr-python 3.x has about 61 failures, all of which are related to data types that aren't yet implemented in zarr-python 3.x.~~

I'll also note that #5475 is going to become a larger issue once people start writing Zarr-V3 datasets.

Closes Zarr Python 3 tracking issue #9515
Closes Is _FillValue really the same as zarr's fill_value? #5475
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

xarray/tests/test_backends.py

TomAugspurger

This set of changes should be backwards compatible and work with zarr-python 2.x (so reading and writing zarr v2 data).

I'll work through zarr-python 3.x now. I think we might want to parametrize most of these tests by zarr_version=[2, 3] to confirm that we can read / write zarr v2 data with zarr-python 3.x

xarray/backends/zarr.py

TomAugspurger · 2024-09-30T14:10:59Z

xarray/backends/zarr.py

+
+        if _zarr_v3() and zarr_array.metadata.zarr_format == 3:
+            encoding["codec_pipeline"] = [
+                x.to_dict() for x in zarr_array.metadata.codecs


Maybe this instead?

Suggested change

x.to_dict() for x in zarr_array.metadata.codecs

zarr_array.metadata.to_dict()["codecs"]

A bit wasteful since everything has to be serialized, but presumably zarr knows better how to serialize the codec pipeline than we do here?

* removed open_consolidated workarounds * removed _store_version check * pass through zarr_version

- skip write_empty_chunks on 3.x - update patch targets

jhamman

Great progress here @TomAugspurger. I'm impressed by how little you've changed in the backend itself and I'm noting the pain around testing (I felt some of that w/ dask as well).

xarray/backends/zarr.py

TomAugspurger · 2024-10-15T13:51:44Z

I just pushed a commit reverting the changes to avoid values equal to the fill_value in the test cases, which aren't needed after Ryan's changes to fill value handling.

I think this is ready to go once CI finishes. I expect upstream-ci to fail on the xarray/tests/test_groupby.py::test_gappy_resample_reductions tests, but those should be unrelated.

TomAugspurger · 2024-10-15T14:08:19Z

There's one typing failure we might want to address:

xarray/tests/test_backends_datatree.py: note: In member "test_to_zarr_zip_store" of class "TestZarrDatatreeIO":
xarray/tests/test_backends_datatree.py:289: error: Argument 1 to "open_datatree" has incompatible type "ZipStore"; expected "str | PathLike[Any] | BufferedIOBase | AbstractDataStore"  [arg-type]

I'll do some reading about how best to handle type annotations when the proper type depends on the version of a dependency.

Edit: a complication here is that this is in open_datatree, which I think supports other backends like NetCDF too. It's not immediately clear to me whether the signature can of open_datatree needs to match for every implementation. This can maybe be addressed once we fully support datatree with zarr-python 3?

TomNicholas · 2024-10-16T00:25:48Z

Edit: a complication here is that this is in open_datatree, which I think supports other backends like NetCDF too. It's not immediately clear to me whether the signature can of open_datatree needs to match for every implementation. This can maybe be addressed once we fully support datatree with zarr-python 3?

I don't see why the typing of open_datatree(<zarr-store>) would need to be any different to that of open_dataset(<zarr-store>). But I think it's fine to ignore this for now as we know we need to come back to it in another PR anyway.

TomAugspurger · 2024-10-16T14:26:51Z

Good catch, this affects both.

I was hoping something like this would work:

from pathlib import Path

try:
    from zarr.storage import StoreLike as _StoreLike

except ImportError:
    _StoreLike = str | Path

StoreLike = type[_StoreLike]


def f(x: StoreLike) -> StoreLike:
    return x

but mypy doesn't like that

test.py:7: error: Cannot assign multiple types to name "_StoreLike" without an explicit "Type[...]" annotation  [misc]
Found 1 error in 1 file (checked 1 source file)

jhamman · 2024-10-18T16:21:04Z

but mypy doesn't like that

my 2 cents... we should not get hung up on this right now. a) there are plenty of other failures in the upstream-dev-mypy check unrelated to this PR and b) its probably not worth hacking something in here when there are bigger issues with the upstream zarr implementation to sort out.

dcherian

Thanks @TomAugspurger et al. This looks good. I have some minor comments, which I can address later today.

xarray/backends/zarr.py

dcherian · 2024-10-21T17:28:16Z

xarray/backends/zarr.py

-            zarr.consolidate_metadata(self.zarr_group.store)
+            kwargs = {}
+            if _zarr_v3():
+                # https://github.com/zarr-developers/zarr-python/pull/2113#issuecomment-2386718323


Can this be removed at some point in the future? If so, it would be good to add a TODO

I'll look more closely later, but for now I think this will be required, following a deliberate change in zarr v3 consolidated metadata.

With v2 metadata, I think that consolidated happened at the store-level, and was all-or-nothing. If you have two Groups with Arrays, the consolidated metadata will be placed at the store root, and will contain everything:

# zarr v2 In [1]: import json, xarray as xr In [2]: store = {} In [3]: a = xr.tutorial.load_dataset("air_temperature") In [4]: b = xr.tutorial.load_dataset("rasm") In [5]: a.to_zarr(store=store, group="A") /Users/tom/gh/zarr-developers/zarr-v2/.direnv/python-3.10/lib/python3.10/site-packages/xarray/core/dataset.py:2562: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs return to_zarr( # type: ignore[call-overload,misc] Out[5]: <xarray.backends.zarr.ZarrStore at 0x11113edc0> In [6]: b.to_zarr(store=store, group="B") Out[6]: <xarray.backends.zarr.ZarrStore at 0x10cab2440> In [7]: list(json.loads(store['.zmetadata'])['metadata']) Out[7]: # contains nodes from both A and B ['.zgroup', 'A/.zattrs', 'A/.zgroup', 'A/air/.zarray', 'A/air/.zattrs', 'A/lat/.zarray', 'A/lat/.zattrs', 'A/lon/.zarray', 'A/lon/.zattrs', 'A/time/.zarray', 'A/time/.zattrs', 'B/.zattrs', 'B/.zgroup', 'B/Tair/.zarray', 'B/Tair/.zattrs', 'B/time/.zarray', 'B/time/.zattrs', 'B/xc/.zarray', 'B/xc/.zattrs', 'B/yc/.zarray', 'B/yc/.zattrs']

With v3, consolidated metadata is scoped to a Group, so we can provide the group we want to consolidated (the zarr-python API does support "consolidate everything in the store at the root", but I don't think we want that because you'd need to open it at the root when reading, and I think it's kinda where for ds.to_zarr(group="A") to be reading / writing stuff outside of the A prefix).

Potentially it would make sense to have two versions of consolidated metadata:

Everything at a specific group/node level

Everything in a group and all of its subgroups (i.e., for DataTree)

Agreed. zarr-developers/zarr-specs#309 has some discussion on adding a depth field to the spec for consolidated metadata. That's currently implicitly depth=None, which is everything below a group. depth=0 or 1 would be just the immediate children. That's not standardized or implemented anywhere yet, but the current implementation is forwards compatible and it shouldn't be a ton of effort.

xarray/backends/zarr.py

xarray/tests/test_backends.py

xarray/core/dataset.py

* main: Fix multiple grouping with missing groups (pydata#9650) flox: Properly propagate multiindex (pydata#9649) Update Datatree html repr to indicate inheritance (pydata#9633) Re-implement map_over_datasets using group_subtrees (pydata#9636) fix zarr intersphinx (pydata#9652) Replace black and blackdoc with ruff-format (pydata#9506) Fix error and missing code cell in io.rst (pydata#9641) Support alternative names for the root node in DataTree.from_dict (pydata#9638) Updates to DataTree.equals and DataTree.identical (pydata#9627) DOC: Clarify error message in open_dataarray (pydata#9637) Add zip_subtrees for paired iteration over DataTrees (pydata#9623) Type check datatree tests (pydata#9632) Add missing `memo` argument to DataTree.__deepcopy__ (pydata#9631) Bug fixes for DataTree indexing and aggregation (pydata#9626) Add inherit=False option to DataTree.copy() (pydata#9628) docs(groupby): mention deprecation of `squeeze` kwarg (pydata#9625) Migration guide for users of old datatree repo (pydata#9598) Reimplement Datatree typed ops (pydata#9619)

dcherian · 2024-10-21T21:12:55Z

Let's get this in by the end of the week.

* main: Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655)

jhamman · 2024-10-23T17:24:30Z

👏 Thanks all! Especially @TomAugspurger for doing the lion's share of the work here.

* main: Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658)

* main: (85 commits) Refactor out utility functions from to_zarr (pydata#9695) Use the same function to floatize coords in polyfit and polyval (pydata#9691) Add `DataTree.persist` (pydata#9682) Typing annotations for arithmetic overrides (e.g., DataArray + Dataset) (pydata#9688) Raise `ValueError` for unmatching chunks length in `DataArray.chunk()` (pydata#9689) Fix inadvertent deep-copying of child data in DataTree (pydata#9684) new blank whatsnew (pydata#9679) v2024.10.0 release summary (pydata#9678) drop the length from `numpy`'s fixed-width string dtypes (pydata#9586) fixing behaviour for group parameter in `open_datatree` (pydata#9666) Use zarr v3 dimension_names (pydata#9669) fix(zarr): use inplace array.resize for zarr 2 and 3 (pydata#9673) implement `dask` methods on `DataTree` (pydata#9670) support `chunks` in `open_groups` and `open_datatree` (pydata#9660) Compatibility for zarr-python 3.x (pydata#9552) Update to_dataframe doc to match current behavior (pydata#9662) Reduce graph size through writing indexes directly into graph for ``map_blocks`` (pydata#9658) Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655) Fix multiple grouping with missing groups (pydata#9650) ...

TomAugspurger commented Sep 27, 2024

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

TomAugspurger mentioned this pull request Sep 27, 2024

Zarr Python 3 tracking issue #9515

Closed

4 tasks

TomAugspurger added 4 commits September 30, 2024 08:53

Remove zarr pin

c54a052

Define zarr_v3 helper

483eb7f

zarr-v3: filters / compressors -> codecs

40a746c

zarr-v3: update tests to avoid values equal to fillValue

6c8d2bb

TomAugspurger force-pushed the fix/zarr-v3 branch 2 times, most recently from 1ed4ef1 to bb2bb6c Compare September 30, 2024 14:04

TomAugspurger commented Sep 30, 2024

View reviewed changes

TomAugspurger force-pushed the fix/zarr-v3 branch from 9f2cb2f to d11d593 Compare September 30, 2024 14:29

TomAugspurger added 6 commits September 30, 2024 11:43

Various test fixes

531b521

zarr_version fixes

849df40

* removed open_consolidated workarounds * removed _store_version check * pass through zarr_version

fixup! zarr-v3: filters / compressors -> codecs

88bd64b

fixup! fixup! zarr-v3: filters / compressors -> codecs

ef1549a

fixup

20c22bd

path / key normalization in set_variables

6087e5e

TomAugspurger force-pushed the fix/zarr-v3 branch from a324329 to 6087e5e Compare September 30, 2024 16:44

TomAugspurger added 11 commits October 1, 2024 11:55

fixes

15fe55e

workaround nested consolidated metadata

8e06bc7

Merge remote-tracking branch 'upstream/main' into fix/zarr-v3

f22100e

test: avoid fill_value

f8c427f

test: Adjust call counts

594d36d

zarr-python 3.x Array.resize doesn't mutate

046d37e

test compatibility

6b0ca62

- skip write_empty_chunks on 3.x - update patch targets

skip ZipStore with_mode

d315583

test: more fill_value avoidance

389cc82

test: more fill_value avoidance

1fe409a

v3 compat for instrumented test

7c29ea6

jhamman reviewed Oct 2, 2024

View reviewed changes

xarray/backends/zarr.py Show resolved Hide resolved

xarray/backends/zarr.py Show resolved Hide resolved

mpiannucci mentioned this pull request Oct 3, 2024

Write virtual references to Icechunk earth-mover/VirtualiZarr#1

Closed

TomAugspurger added 4 commits October 15, 2024 07:30

Fixup

d752693

Merge remote-tracking branch 'upstream/main' into fix/zarr-v3

0fd4103

more cleanup

c2a47a1

revert test changes

26b2661

TomNicholas mentioned this pull request Oct 20, 2024

Add Zarr v3 dependency zarr-developers/VirtualiZarr#182

Draft

9 tasks

dcherian reviewed Oct 21, 2024

View reviewed changes

dcherian and others added 4 commits October 21, 2024 14:49

Update xarray/backends/zarr.py

1d73d36

cleanup

be79e88

update docstring

ff0f2c0

dcherian added the plan to merge Final call for comments label Oct 21, 2024

dcherian approved these changes Oct 21, 2024

View reviewed changes

dcherian added 3 commits October 22, 2024 07:13

fix rtd

268e3eb

Merge branch 'main' into fix/zarr-v3

1abb2ba

* main: Add close() method to DataTree and use it to clean-up open files in tests (pydata#9651) Change URL for pydap test (pydata#9655)

tweak

7682bf4

mpiannucci mentioned this pull request Oct 22, 2024

Virtual Dataset Workflow Tracking Issue earth-mover/icechunk#197

Open

5 tasks

TomNicholas mentioned this pull request Oct 23, 2024

Update dependencies for xarray, zarr-python, icechunk, kerchunk zarr-developers/VirtualiZarr#268

Open

4 tasks

dcherian merged commit b133fdc into pydata:main Oct 23, 2024
27 of 29 checks passed

TomAugspurger deleted the fix/zarr-v3 branch October 23, 2024 19:51

TomAugspurger restored the fix/zarr-v3 branch October 23, 2024 19:51

TomNicholas mentioned this pull request Nov 6, 2024

Open multiple groups (e.g. as DataTree) with zarr-python v3 #9733

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility for zarr-python 3.x #9552

Compatibility for zarr-python 3.x #9552

TomAugspurger commented Sep 27, 2024 •

edited by dcherian

Loading

TomAugspurger left a comment

TomAugspurger Sep 30, 2024

jhamman left a comment

TomAugspurger commented Oct 15, 2024

TomAugspurger commented Oct 15, 2024 •

edited

Loading

TomNicholas commented Oct 16, 2024 •

edited

Loading

TomAugspurger commented Oct 16, 2024

jhamman commented Oct 18, 2024

dcherian left a comment

dcherian Oct 21, 2024

TomAugspurger Oct 23, 2024 •

edited

Loading

shoyer Oct 23, 2024

TomAugspurger Oct 23, 2024

dcherian commented Oct 21, 2024

jhamman commented Oct 23, 2024

	x.to_dict() for x in zarr_array.metadata.codecs
	zarr_array.metadata.to_dict()["codecs"]

Compatibility for zarr-python 3.x #9552

Compatibility for zarr-python 3.x #9552

Conversation

TomAugspurger commented Sep 27, 2024 • edited by dcherian Loading

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Sep 30, 2024

Choose a reason for hiding this comment

jhamman left a comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 15, 2024

TomAugspurger commented Oct 15, 2024 • edited Loading

TomNicholas commented Oct 16, 2024 • edited Loading

TomAugspurger commented Oct 16, 2024

jhamman commented Oct 18, 2024

dcherian left a comment

Choose a reason for hiding this comment

dcherian Oct 21, 2024

Choose a reason for hiding this comment

TomAugspurger Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

shoyer Oct 23, 2024

Choose a reason for hiding this comment

TomAugspurger Oct 23, 2024

Choose a reason for hiding this comment

dcherian commented Oct 21, 2024

jhamman commented Oct 23, 2024

TomAugspurger commented Sep 27, 2024 •

edited by dcherian

Loading

TomAugspurger commented Oct 15, 2024 •

edited

Loading

TomNicholas commented Oct 16, 2024 •

edited

Loading

TomAugspurger Oct 23, 2024 •

edited

Loading