Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache pre-existing Zarr arrays in Zarr backend #9861

Open
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

d-v-b
Copy link

@d-v-b d-v-b commented Dec 6, 2024

ZarrStore calls array_keys on the zarr.Group it wraps multiple times. Each of these array_keys calls performs IO, but if we model the names of the arrays as static, then they can be cached, then we can do less IO.

Accordingly this PR adds a new method to ZarrStore (get_array_keys), a new keyword argument for ZarrStore.__init__ (cache_array_keys), and a new attribute to ZarrStore (_cache). get_array_keys gets the array keys from the .zarr_group instance contained in the ZarrStore, either by reading the keys from the cache (if caching is enabled and the cache is populated), or by directly invoking ZarrStore.zarr_group.array_keys() (if caching is disabled, or the cache is empty).

Caching is controlled by the cache_array_keys keyword argument in ZarrStore.__init__.

  • If ZarrStore.__init__ is called with with cache_array_keys=True, the _cache attribute of the ZarrStore is set to {"array_keys": None}. On the first invocation of get_array_keys, this empty-but-active cache is filled with a call to zarr_group.array_keys, and that value is returned. On subsequent invocations of get_array_keys, the value from the cache is retrieved and returned.
  • If ZarrStore.__init__ is called with cache_array_keys=False, then the _cache attribute never gets a array_keys key inserted, and this indicates to get_array_keys that the cache is not being used. Calls to get_array_keys will return the result of zarr_group.array_keys.

Let me know if the _cache thing is a bit baroque -- my initial effort added a _cache_array_keys attribute as well as the cache, but I figured the state of the cache itself can convey whether the cache is meant to be used, which removes the need for a "use the cache or not" attribute on ZarrStore.

And I would like some feedback on the test -- is it in the right place?

closes #9853

Copy link

welcome bot commented Dec 6, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@dcherian
Copy link
Contributor

dcherian commented Dec 6, 2024

If you swtich the equality to != we should see a decrease in list_prefix:

assert summary[k] <= expected[k], (k, summary)

xarray/backends/zarr.py Outdated Show resolved Hide resolved
@d-v-b
Copy link
Author

d-v-b commented Dec 6, 2024

@dcherian I have implemented caching both array keys and arrays, but for that reason this test is failing:

def test_write_store(self) -> None:
expected = create_test_data()
with self.create_store() as store:
expected.dump_to_store(store)
# we need to cf decode the store because it has time and
# non-dimension coordinates
with xr.decode_cf(store) as actual:
assert_allclose(expected, actual)

This test assumes that the ZarrStore will track changes to the underlying Zarr data which makes caching tricky. Does re-using the ZarrStore here reflect real-world usage, or is it just for testing convenience? and thus should I modify this test to create a new ZarrStore, or explicitly turn off the cache for this test?

@dcherian
Copy link
Contributor

dcherian commented Dec 6, 2024

oh wow, I did not even know that there was a Dataset.dump_to_store method!

This code is from 10 years ago and I don't think anyone is using it like this. Today this test would be

with self.roundtrip(expected) as actual:
    assert_identical(actual, expected)

cc @shoyer

xarray/backends/zarr.py Outdated Show resolved Hide resolved
@d-v-b
Copy link
Author

d-v-b commented Dec 10, 2024

oh wow, I did not even know that there was a Dataset.dump_to_store method!

This code is from 10 years ago and I don't think anyone is using it like this. Today this test would be

with self.roundtrip(expected) as actual:
    assert_identical(actual, expected)

cc @shoyer

I altered the test as you recommended, let me know if that change should be in a separate PR

xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
@d-v-b d-v-b force-pushed the perf/cache-array-keys branch 2 times, most recently from a7602f6 to 9d147b9 Compare December 10, 2024 16:57
@dcherian dcherian changed the title perf/cache array keys Cache pre-existing Zarr arrays in Zarr backend Dec 12, 2024
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
xarray/backends/zarr.py Outdated Show resolved Hide resolved
yield group

def test_write_store(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be over-ridden still if the default is to turn the cache off

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, because we need to explicitly ensure that the cache is off, otherwise tests will fail. If we rely on the default behavior (which is defined very far away from this test), and if that default value changes, then tests will fail here and it will not be clear why.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way to remove this code would be to change the tests so that they no longer rely on the old ZarrStore behavior; maybe this could be a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push a commit. The create_store on this class seems pointless.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Clearly its pretty effective. I left some minor suggestions.

@dcherian dcherian requested a review from jhamman December 12, 2024 03:37
d-v-b and others added 4 commits December 14, 2024 12:26
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Deepak Cherian <[email protected]>
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @d-v-b. Left a comment on the default value.

Also, please add a what's new note in the docs.

@@ -633,6 +635,7 @@ def open_store(
zarr_format=None,
use_zarr_fill_value_as_mask=None,
write_empty: bool | None = None,
cache_members: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about changing the default to True? I'd like to know if we think there is a downside (risk, performance, etc.) to defaulting to this behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some weird tests that use open_store and dump_to_store that can't handle that default.

For nearly all users, the effective default is the value set in open_zarr and to_zarr (which is True, or should be after that one suggestion is merge)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm up for it! the main risk is whether anyone is actively using ZarrStore instances explicitly, with the assumption that they dynamically track the state of the underlying zarr Group. if nobody is doing that, then there's no risk. ofc I have no way of know if this use case is common, but my impression was that ZarrStore is internal enough that users are not expected to be using it directly.

xarray/backends/zarr.py Outdated Show resolved Hide resolved
@d-v-b
Copy link
Author

d-v-b commented Dec 17, 2024

CI is failing on mypy due to the following:

 Found 5 errors in 1 file (checked 170 source files)
xarray/tests/test_plot.py:98: error: Unused "type: ignore" comment  [unused-ignore]
xarray/tests/test_plot.py:105: error: Unused "type: ignore" comment  [unused-ignore]
xarray/tests/test_plot.py:113: error: Unused "type: ignore" comment  [unused-ignore]
xarray/tests/test_plot.py:125: error: Unused "type: ignore" comment  [unused-ignore]

I don't think that's from this PR?

@d-v-b
Copy link
Author

d-v-b commented Dec 17, 2024

Looks great @d-v-b. Left a comment on the default value.

Also, please add a what's new note in the docs.

"what's new" has been added and the default value is now True. CI is still red due to some mypy issues that I think are unrelated to my pr (see #9861 (comment))

@dcherian dcherian added the plan to merge Final call for comments label Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reducing IO in ZarrStore
3 participants