Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr.group creates unexpected chain of .zgroups in the store when using the path= kwarg #1257

Open
ChiliJohnson opened this issue Nov 10, 2022 · 4 comments
Labels
bug Potential issues with the zarr-python library

Comments

@ChiliJohnson
Copy link

Zarr version

v2.13.3

Numcodecs version

N/A

Python Version

3.9.15

Operating System

macOS

Installation

Pip + venv

Description

When using zarr.group with the path kwarg, Zarr will still register .zgroups at the root of the store, all the way down to the specified path. I'm fairly new to Zarr (so I may be missing something) but this feels unexpected to me, I feel like inuitively zarr.group(…, path=…) should only register a .zgroup at the path specified

Steps to reproduce

from pprint import pprint
import zarr

store = {
    "myKey/test.zarr/.zgroup": b'existing content!',
}

try:
    root = zarr.group(store=store, path="myKey/test.zarr", overwrite=False)
except Exception as exc:
    print(exc, "\n")

pprint(store)

Output:

path 'myKey/test.zarr' contains a group 

{'.zgroup': b'{\n    "zarr_format": 2\n}',
 'myKey/.zgroup': b'{\n    "zarr_format": 2\n}',
 'myKey/test.zarr/.zgroup': b'existing content!'}

There are two things to note here I think:

  1. Zarr successfully detects that there is already a group registered at that path in the store
  2. Despite this check, it still registers extra .zgroups from the root of the store down to the specified path

Additional output

No response

@ChiliJohnson ChiliJohnson added the bug Potential issues with the zarr-python library label Nov 10, 2022
@joshmoore
Copy link
Member

In my (or perhaps Zarr's) mind, both myKey and test.zarr in your path are also groups. So what's happening is the call to zarr.group(..., overwrite=False) is saying, "ok, @ChiliJohnson wants there to be a total of three groups, but I'm not to overwrite any of them". So .zgroup and myKey/.zgroup get created (since they didn't exist), but then it gets to the lower-level, realizes that it can't be overwritten and throws.

One could possibly argue that nothing should get created if the action isn't possible, but for the moment, it's not atomic.

Looking at your string, it seems like you intend for myKey/test.zarr to actually be the store. Do that sound more like what you are trying to achieve?

@ChiliJohnson
Copy link
Author

Thanks for the write up and clarification, yes in this case myKey/test.zarr should definitely be the store, and using it that way does make everything work as expected!

I wonder if there might be a benefit to having a small section in the official tutorial about the semantics of a store and how to avoid a pitfall like this? As someone new to Zarr but not new to Python or cloud object stores my original process as a Zarr noob when like this:

  • I have hundreds of Zarr groups inside this S3 bucket that I want to write to / read from within a web service
  • As an optimization, maybe I don't want to create ans3fs.S3Map object for each group
  • Maybe I should just make one s3fs.S3Map object for the bucket and use Zarr's nice path= kwargs
  • Why do I have so many extra .zgroups at every key prefix in my bucket?
  • Oh wait why are there tons of slow, extra AWS API calls each time I call open?

After digging into it I realized that zarr.group was actually calling __iter__ on the store which, in the case of s3fs, ended up listing the entire contents of this bucket multiple times over on each call to open.

I can do some thinking and maybe draft a PR for a section on store semantics in the official tutorial, but I think initially having a little bit of guidance like this could've been helpful as someone new:

  • When writing data, even at a specific sub-path, Zarr stores are by definition groups themselves. When writing an array or group at a sub-path, Zarr will create groups at every level from the store's root down to whatever specific array/group you're writing to.
  • When reading data event at a specific sub-path of a store, Zarr will iterate over every object in the store. Potential pitfall: using a Zarr store at the "root" or "bucket" level of a cloud object store will result in the entire content of the bucket being listed when opening / writing Zarr objects

Thanks for taking the time to explain!

@joshmoore
Copy link
Member

❤️ for additions where they would have helped you out.

@rabernat
Copy link
Contributor

Thanks @ChiliJohnson - your issue has surfaced some extreme inefficiencies in how data is created in Zarr. IMO there are many unnecessary / duplicitous checks such as...

After digging into it I realized that zarr.group was actually calling __iter__ on the store which, in the case of s3fs, ended up listing the entire contents of this bucket multiple times over on each call to open.

  • Oh wait why are there tons of slow, extra AWS API calls each time I call open?

This is similar to some of the diagnosis I did over in pangeo-data/pangeo-eosc#39 (comment) trying to understand why writing was slow.

I would be quite keen to rewrite some of this code to be be more performant with cloud storage.

Another thing you might want to consider going forward is to switch to Zarr V3 as your format. I think that's where we should be focusing most of our effort going forward. (But it won't solve the core problem.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

3 participants