Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr Python 3 tracking issue #9515

Closed
3 of 4 tasks
jhamman opened this issue Sep 18, 2024 · 12 comments
Closed
3 of 4 tasks

Zarr Python 3 tracking issue #9515

jhamman opened this issue Sep 18, 2024 · 12 comments
Labels
topic-zarr Related to zarr storage library upstream issue

Comments

@jhamman
Copy link
Member

jhamman commented Sep 18, 2024

What is your issue?

Zarr-Python 3.0 is getting close to a full release. This issue tracks the integration of the 3.0 release with Xarray.

Here's a running list of issues we're solving upstream related to integration with Xarray:

Special shout out to @TomAugspurger has been front running a lot of this 🙌.

@jhamman jhamman added the needs triage Issue that has not been reviewed by xarray team member label Sep 18, 2024
@TomNicholas TomNicholas added upstream issue topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 18, 2024
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 18, 2024

Testing this out

There multiple active branches right now, but you can get a usable xarray + zarr-python 3.x with these two branches:

You can install these with:

pip install git+https://github.com/TomAugspurger/zarr-python@xarray-compat git+https://github.com/TomAugspurger/xarray/@fix/zarr-v3

Work Items

We require some changes on both zarr-python and xarray. I'm pushing the zarr ones to TomAugspurger/zarr-python@xarary-compat and the xarray ones to `TomAugspurger/xarray@fix/zarr-v3.

zarr-python

xarray

Most of these are in my PR at #9552

Fixed issues

Things to investigate:

  • separate store / chunk_store
  • writing a subset of regions

@dcherian
Copy link
Contributor

@TomAugspurger are you able to open a WIP PR with in-progress work. It'd be nice to see what's needed

@TomAugspurger
Copy link
Contributor

Sure, #9552 has that.

@TomAugspurger
Copy link
Contributor

Question for the group: does anyone object to xarray continuing to write Zarr V2 datasets by default? I hesitate to have xarray's default be different from zarr-python's, but that would relive some pressure to address #5475 quickly, since v2 datasets should be round-tripable.

@TomAugspurger
Copy link
Contributor

I think that support for reading Zarr V2 datasets with zarr-python v3 is close to being ready. I updated #9515 (comment) with some instructions on how to install two branches if anyone is able to test that out:

In [4]: xr.open_dataset("abfs://daymet-zarr/annual/hi.zarr", engine="zarr", storage_options={"account_name": "daymeteuwest"})
Out[4]:
<xarray.Dataset> Size: 137MB
Dimensions:                  (y: 584, x: 284, time: 41, nv: 2)
Coordinates:
    lat                      (y, x) float32 663kB ...
    lon                      (y, x) float32 663kB ...
  * time                     (time) datetime64[ns] 328B 1980-07-01T12:00:00 ....
  * x                        (x) float32 1kB -5.802e+06 ... -5.519e+06
  * y                        (y) float32 2kB -3.9e+04 -4e+04 ... -6.22e+05
...
    start_year:        1980

In [5]: xr.open_dataset("s3://cmip6-pds/CMIP6/ScenarioMIP/AS-RCEC/TaiESM1/ssp126/r1i1p1f1/Amon/clt/gn/v20201124", engine="zarr", storage_options={"anon": True})
Out[5]:
<xarray.Dataset> Size: 228MB
Dimensions:    (time: 1032, lat: 192, lon: 288, bnds: 2)
Coordinates:
  * lat        (lat) float64 2kB -90.0 -89.06 -88.12 -87.17 ... 88.12 89.06 90.0
...
    variant_label:             r1i1p1f1

@maxrjones
Copy link
Contributor

maxrjones commented Oct 18, 2024

Thank you for all your work on Zarr V3 in Xarray! Super excited about the progress here 🎉

I've been using the https://github.com/pydata/xarray/tree/zarr-v3 branch to test out icechunk, and thought I'd share some odd behavior when writing data to S3 just in case it's not a known issue. I ran this code multiple times - the first time only the attributes were written and the second the col and row variables were also written, but the S variable would never get written. All data variables were written if I used a local filesystem rather than object storage:

testv3_store = "s3://nasa-veda-scratch/test-weight-caching/scratch/test-v3.zarr"
s = np.ones(10, dtype='float64')
c = np.ones(10, dtype='int32')
r = np.ones(10, dtype='int32')
ds = xr.Dataset(
    data_vars=dict(
        S = (["n_s"], s),
        col = (["n_s"], c),
        row = (["n_s"], r)
    ),
    attrs=dict(n_in = 100, n_out = 200)
)
print(f"Original data vars: {ds.data_vars}")
ds.to_zarr(testv3_store, zarr_format=3, mode="w")
ds2 = xr.open_zarr(testv3_store, zarr_format=3).load()
print(f"Round tripped data vars: {ds2.data_vars}")
Original data vars: Data variables:
    S        (n_s) float64 80B 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
    col      (n_s) int32 40B 1 1 1 1 1 1 1 1 1 1
    row      (n_s) int32 40B 1 1 1 1 1 1 1 1 1 1
Round tripped data vars: Data variables:
    col      (n_s) int32 40B 1 1 1 1 1 1 1 1 1 1
    row      (n_s) int32 40B 1 1 1 1 1 1 1 1 1 1

p.s., this issue didn't seem to happen when I was writing/reading from an icechunk store

@TomAugspurger
Copy link
Contributor

Thanks for the testing it out and reporting that issue. I'll see if I can reproduce it.

@jhamman
Copy link
Member Author

jhamman commented Oct 19, 2024

This sounds like an issue with the RemoteStore in zarr3. I wonder if something bad is happening because S is a single character var name.

@TomAugspurger
Copy link
Contributor

I'm able to reproduce with the moto test server used in the zarr-python tests. It seems like all the files were written:

(Pdb) pp store.fs.glob("test/**/*")
['test/test-weight-caching/scratch/test-v3.zarr',
 'test/test-weight-caching/scratch/test-v3.zarr/S',
 'test/test-weight-caching/scratch/test-v3.zarr/S/c',
 'test/test-weight-caching/scratch/test-v3.zarr/S/c/0',
 'test/test-weight-caching/scratch/test-v3.zarr/S/zarr.json',
 'test/test-weight-caching/scratch/test-v3.zarr/col',
 'test/test-weight-caching/scratch/test-v3.zarr/col/c',
 'test/test-weight-caching/scratch/test-v3.zarr/col/c/0',
 'test/test-weight-caching/scratch/test-v3.zarr/col/zarr.json',
 'test/test-weight-caching/scratch/test-v3.zarr/row',
 'test/test-weight-caching/scratch/test-v3.zarr/row/c',
 'test/test-weight-caching/scratch/test-v3.zarr/row/c/0',
 'test/test-weight-caching/scratch/test-v3.zarr/row/zarr.json',
 'test/test-weight-caching/scratch/test-v3.zarr/zarr.json']

I think the array metadata is correct:

(Pdb) pp json.loads(store.fs.cat("test/test-weight-caching/scratch/test-v3.zarr/S/zarr.json"))
{'attributes': {'_ARRAY_DIMENSIONS': ['n_s'], '_FillValue': 'AAAAAAAA+H8='},
 'chunk_grid': {'configuration': {'chunk_shape': [10]}, 'name': 'regular'},
 'chunk_key_encoding': {'configuration': {'separator': '/'}, 'name': 'default'},
 'codecs': [{'configuration': {'endian': 'little'}, 'name': 'bytes'}],
 'data_type': 'float64',
 'fill_value': 0.0,
 'node_type': 'array',
 'shape': [10],
 'storage_transformers': [],
 'zarr_format': 3}

Ah, the consolidated metadata is missing though?

(Pdb) pp json.loads(store.fs.cat("test/test-weight-caching/scratch/test-v3.zarr/zarr.json"))
{'attributes': {'n_in': 100, 'n_out': 200},
 'consolidated_metadata': {'kind': 'inline',
                           'metadata': {},
                           'must_understand': False},
 'node_type': 'group',
 'zarr_format': 3}

Some print statements shows that consolidated_metadata is being called, and we are writing the output to zarr.json. But it fails to discover the metadata under the group. I guess that's show in the output above, since we have metadata: {}, rather than metadata: None.

So right now it looks like an issue with Group.members using a remote store. I'll keep looking.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 19, 2024

@maxrjones when you get a chance, can you try with storage_options={"use_listings_cache": False} in the to_zarr and open_zarr calls? That seems to fix it for me locally.


Assuming that is the issue, the problem is fsspec's dircache, which maintains a cache of directory listings. We must have previously listed the zarr.json directory and found just the zarr.json. After we've written the arrays, when we go to consolidate the metadata we list the zarr.json again, we get the stale read from the directory cache. We'll need to find a way to handle this in zarr-python. IMO, fsspec should at least be able to invalidate a cache when it's the one doing the invalidating (by writing / deleting objects).

Edit: It looks like s3fs does invalidate its dircache at https://github.com/fsspec/s3fs/blob/dd75a1a076d176049ff1f3d9f616f1a724f794df/s3fs/core.py#L1173. zarr's RemoteStore.set uses pipe_file. Oh, but seems like the early return here might break that. I'll investigate and will open an issue on s3fs.

Reported at fsspec/s3fs#903.

So for now you'll need the invalidate_cache. I don't think that zarr should implement any workarounds itself...

@jhamman
Copy link
Member Author

jhamman commented Oct 19, 2024

Thanks for digging into this @TomAugspurger - @rabernat and I hit something similar a week ago. See zarr-developers/zarr-python#2348 for a zarr-side attempt at working around the fsspec cache issues.

@maxrjones
Copy link
Contributor

@maxrjones when you get a chance, can you try with storage_options={"use_listings_cache": False} in the to_zarr and open_zarr calls? That seems to fix it for me locally.

Yes, adding those storage options to the to_zarr call fixes the issue. It doesn't seem to be necessary on the open_zarr call. I think this makes sense for why it wasn't an issue when using icechunk, since it doesn't rely on fsspec afaik. Thanks @TomAugspurger!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-zarr Related to zarr storage library upstream issue
Projects
None yet
Development

No branches or pull requests

5 participants