Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly closes zarr groups in zarr store #8425

Merged
merged 9 commits into from
Dec 1, 2023

Conversation

CarlAndersson
Copy link
Contributor

@CarlAndersson CarlAndersson commented Nov 8, 2023

If the zarr backend is using a store which needs closing, e.g., a zip store, it should be closed with the xarray zarrstore backend.
This is occurred when i tried to save a dataset to a zarr file while giving the .zip file extension.

I have not done extensive tests on this small piece of code, so it might not cover all use cases.
Let me know if I need to add tests or more documentation!

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

If the zarr backend is using a store which needs closing, e.g., a zip store, it should be closed with the xarray zarrstore backend.
Copy link

welcome bot commented Nov 8, 2023

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library io labels Nov 8, 2023
@max-sixty
Copy link
Collaborator

Thanks for the PR @CarlAndersson !

I notice that a bunch of tests are failing. What's the motivating case for where the existing code was failing?

@CarlAndersson
Copy link
Contributor Author

The code I had that was failing can be replicated as

original_da = xr.DataArray(np.arange(12).reshape((3, 4)))
original_da.to_zarr('tmp.zarr.zip', mode='w')
loaded_data = xr.load_dataarray('tmp.zarr.zip', engine='zarr')

which fails with a BadZipFile: File is not a zip file.

The fix was to add a .zarr_group.store.close() call on the ZarrStore object returned from to_zarr:

original_da = xr.DataArray(np.arange(12).reshape((3, 4)))
original_da.to_zarr('tmp.zarr.zip', mode='w').zarr_group.store.close()
loaded_data = xr.load_dataarray('tmp.zarr.zip', engine='zarr')

In my mind, calling close on the ZarrStore should close the underlying containers as well, if they need closing. This would mean that the quick way of saving data to a zipped zarr file would be

original_da = xr.DataArray(np.arange(12).reshape((3, 4)))
original_da.to_zarr('tmp.zarr.zip', mode='w').close()
loaded_data = xr.load_dataarray('tmp.zarr.zip', engine='zarr')

This seems to fail since that the default compute=True in xarray.backends.api.to_zarr calls _finalize_store which goes on to "close" the store. Perhaps my understanding of the purpose of the ZarrStore.close function does not line up with the intended one, but I find the existing behavior surprising, and somewhat surprising that there is a close called on an object just before returning it (implicitly here).

I haven't looked at the implementations of close on all the other backends, so I might just miss what the function is supposed to do.

@max-sixty
Copy link
Collaborator

Thanks for the explanation. I don't know this code that well, so I'm still a bit confused, sorry.

To the extent the existing code closes the store after writing, why is the zip file in a bad state? To the extent it doesn't close the store, why does the proposed code generate errors about closing things that are already closed?

Or are there two entities, and the existing code only closes one?

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CarlAndersson for looking into this!

Would love to see some tests toward this change as well.

@@ -751,7 +751,7 @@ def set_variables(self, variables, check_encoding_set, writer, unlimited_dims=No
writer.add(v.data, zarr_array, region)

def close(self):
pass
self.zarr_group.store.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely want to do this here only when the Zarr backend was responsible for opening the ZipStore. You may need to add some logic/state in open_group to manage this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable, yes. I added some logic for it and a test case that would have failed before this change.

Adds some logic to make sure to only close zarr stores when they were
created as part of the `open_group` method on `ZarrStore`.
@CarlAndersson
Copy link
Contributor Author

This should be working now. There's a test failing, but I'm fairly sure that it's got nothing to do with my changes - it's not using zarr at all as far as I can see.

@keewis
Copy link
Collaborator

keewis commented Nov 17, 2023

yeah, ignore that one, it's flaky

@CarlAndersson
Copy link
Contributor Author

@jhamman @max-sixty Is there anything additional that I should do here? I'm not so sure about the process...

@max-sixty max-sixty added the plan to merge Final call for comments label Nov 30, 2023
Copy link
Member

@andersy005 andersy005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for this contribution, @CarlAndersson

@andersy005 andersy005 enabled auto-merge (squash) December 1, 2023 01:58
@andersy005 andersy005 merged commit b313ffc into pydata:main Dec 1, 2023
Copy link

welcome bot commented Dec 1, 2023

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@max-sixty
Copy link
Collaborator

Thanks a lot @CarlAndersson !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io plan to merge Final call for comments topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants