`zarr.group` creates unexpected chain of `.zgroup`s in the store when using the `path=` kwarg #1257

ChiliJohnson · 2022-11-10T19:58:49Z

Zarr version

v2.13.3

Numcodecs version

N/A

Python Version

3.9.15

Operating System

macOS

Installation

Pip + venv

Description

When using zarr.group with the path kwarg, Zarr will still register .zgroups at the root of the store, all the way down to the specified path. I'm fairly new to Zarr (so I may be missing something) but this feels unexpected to me, I feel like inuitively zarr.group(…, path=…) should only register a .zgroup at the path specified

Steps to reproduce

from pprint import pprint
import zarr

store = {
    "myKey/test.zarr/.zgroup": b'existing content!',
}

try:
    root = zarr.group(store=store, path="myKey/test.zarr", overwrite=False)
except Exception as exc:
    print(exc, "\n")

pprint(store)

Output:

path 'myKey/test.zarr' contains a group 

{'.zgroup': b'{\n    "zarr_format": 2\n}',
 'myKey/.zgroup': b'{\n    "zarr_format": 2\n}',
 'myKey/test.zarr/.zgroup': b'existing content!'}

There are two things to note here I think:

Zarr successfully detects that there is already a group registered at that path in the store
Despite this check, it still registers extra .zgroups from the root of the store down to the specified path

Additional output

No response

The text was updated successfully, but these errors were encountered:

joshmoore · 2022-11-11T08:09:48Z

In my (or perhaps Zarr's) mind, both myKey and test.zarr in your path are also groups. So what's happening is the call to zarr.group(..., overwrite=False) is saying, "ok, @ChiliJohnson wants there to be a total of three groups, but I'm not to overwrite any of them". So .zgroup and myKey/.zgroup get created (since they didn't exist), but then it gets to the lower-level, realizes that it can't be overwritten and throws.

One could possibly argue that nothing should get created if the action isn't possible, but for the moment, it's not atomic.

Looking at your string, it seems like you intend for myKey/test.zarr to actually be the store. Do that sound more like what you are trying to achieve?

ChiliJohnson · 2022-11-11T19:32:45Z

Thanks for the write up and clarification, yes in this case myKey/test.zarr should definitely be the store, and using it that way does make everything work as expected!

I wonder if there might be a benefit to having a small section in the official tutorial about the semantics of a store and how to avoid a pitfall like this? As someone new to Zarr but not new to Python or cloud object stores my original process as a Zarr noob when like this:

I have hundreds of Zarr groups inside this S3 bucket that I want to write to / read from within a web service
As an optimization, maybe I don't want to create ans3fs.S3Map object for each group
Maybe I should just make one s3fs.S3Map object for the bucket and use Zarr's nice path= kwargs
Why do I have so many extra .zgroups at every key prefix in my bucket?
Oh wait why are there tons of slow, extra AWS API calls each time I call open?

After digging into it I realized that zarr.group was actually calling __iter__ on the store which, in the case of s3fs, ended up listing the entire contents of this bucket multiple times over on each call to open.

I can do some thinking and maybe draft a PR for a section on store semantics in the official tutorial, but I think initially having a little bit of guidance like this could've been helpful as someone new:

When writing data, even at a specific sub-path, Zarr stores are by definition groups themselves. When writing an array or group at a sub-path, Zarr will create groups at every level from the store's root down to whatever specific array/group you're writing to.
When reading data event at a specific sub-path of a store, Zarr will iterate over every object in the store. Potential pitfall: using a Zarr store at the "root" or "bucket" level of a cloud object store will result in the entire content of the bucket being listed when opening / writing Zarr objects

Thanks for taking the time to explain!

joshmoore · 2022-11-11T23:37:21Z

❤️ for additions where they would have helped you out.

rabernat · 2022-11-14T21:13:04Z

Thanks @ChiliJohnson - your issue has surfaced some extreme inefficiencies in how data is created in Zarr. IMO there are many unnecessary / duplicitous checks such as...

After digging into it I realized that zarr.group was actually calling __iter__ on the store which, in the case of s3fs, ended up listing the entire contents of this bucket multiple times over on each call to open.

Oh wait why are there tons of slow, extra AWS API calls each time I call open?

This is similar to some of the diagnosis I did over in pangeo-data/pangeo-eosc#39 (comment) trying to understand why writing was slow.

I would be quite keen to rewrite some of this code to be be more performant with cloud storage.

Another thing you might want to consider going forward is to switch to Zarr V3 as your format. I think that's where we should be focusing most of our effort going forward. (But it won't solve the core problem.)

ChiliJohnson added the bug Potential issues with the zarr-python library label Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`zarr.group` creates unexpected chain of `.zgroup`s in the store when using the `path=` kwarg #1257

`zarr.group` creates unexpected chain of `.zgroup`s in the store when using the `path=` kwarg #1257

ChiliJohnson commented Nov 10, 2022

joshmoore commented Nov 11, 2022

ChiliJohnson commented Nov 11, 2022

joshmoore commented Nov 11, 2022

rabernat commented Nov 14, 2022

zarr.group creates unexpected chain of .zgroups in the store when using the path= kwarg #1257

zarr.group creates unexpected chain of .zgroups in the store when using the path= kwarg #1257

Comments

ChiliJohnson commented Nov 10, 2022

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

joshmoore commented Nov 11, 2022

ChiliJohnson commented Nov 11, 2022

joshmoore commented Nov 11, 2022

rabernat commented Nov 14, 2022

`zarr.group` creates unexpected chain of `.zgroup`s in the store when using the `path=` kwarg #1257

`zarr.group` creates unexpected chain of `.zgroup`s in the store when using the `path=` kwarg #1257