Nested storage detection in Zarr V2 #707

joshmoore · 2021-03-05T16:19:16Z

In order to better handle Zarr arrays created with NestedDirectoryStorage or FSStore(key_separator="/"), SabineEmbacher and I have been working on a "protocol heuristic" that can be used by V2 implementations to detect nested chunking rather than requiring the user to specify it correctly.

tl;dr: This proposes a new key for .zarray which it would be good to have feedback on.

Proposal

When creating a zarr array:

add to .zarray json: {"dimension_separator": "/"}
always write a 0-position chunk

When opening an array:

try to read the separator character from the .zarray json
if not available, try to find a 0-position chunk
if not available, at every read action try to find chunks with both standard variants until the situation is clarified. (Standard separator list ["/", "."])

Points for discussion:

The name dimension_separator differs from the code implementation key_separator to reduce confusion about whether every separator in the key name is effected.
There has been some discussion (community call, gitter) about whether or not "/" could become the default.
Is an addition to the .zarray metadata sufficiently low impact to be rolled into v2?

The text was updated successfully, but these errors were encountered:

SabineEmbacher · 2021-03-07T19:31:51Z

Thoughts on the name of the separator:
What the separator separates is the respective chunk index per dimension.
See bcdev/jzarr#19 and bcdev/jzarr#17 (comment) and

Therefore the following suggestions:

chunk_index_separator
chunk_index_sep
chunk_idx_separator
chunk_idx_sep

jbms · 2021-03-15T18:49:42Z

One issue with the detection protocol is that it does not work when writing. While auto-detection is nice it might be better to just assume "." unless the metadata key is present --- any existing arrays could be fixed up, and hopefully all future arrays would have the metadata key. That avoids the added complexity (and inefficiency) of auto-detection.

joshmoore · 2021-03-16T10:45:16Z

Hi @jbms. You mean a mode="a"-style writing? If so, I see the problem. I had initially intended the auto-detection only as a means of not needing to "standardize" a new key. Then I got greedy and went for both. Anyone else have thoughts?

In the case of a mode="w" situation I would assume the writer simply asserts where it will write.

any existing arrays could be fixed up, and hopefully all future arrays would have the metadata key.

Guess there are some edge cases I worry about here. Maybe there are a number of strategies that can be enabled:

auto-detect
fail if missing metadata
set if missing metadata
...

The real question is likely to be what should be the default.

joshmoore · 2021-03-26T14:41:53Z

#707 (comment) What the separator separates is the respective chunk index per dimension.

@SabineEmbacher, @axtimwalde suggests "dimension separator".

axtimwalde · 2021-03-30T16:34:55Z

I used dimensionSeparator https://github.com/saalfeldlab/n5-zarr/blob/master/src/main/java/org/janelia/saalfeldlab/n5/zarr/N5ZarrReader.java#L89 or https://github.com/saalfeldlab/n5-universe/blob/main/src/main/java/org/janelia/saalfeldlab/n5/universe/N5Factory.java#L109 but the other choices sound reasonable too.

SabineEmbacher · 2021-04-06T09:53:46Z

I did not make my suggestions regarding the name in order to enforce them.
They were simply suggestions.
I agree with any decision. Even if a decision does not correspond to my personal preferences.
That is perfectly fine with me.

Please make a decision and I will put it in asap according to the specification.

joshmoore · 2021-04-08T08:08:36Z

I did not make my suggestions regarding the name in order to enforce them.

No worries, @SabineEmbacher. The suggestions were definitely useful. It's more just a matter of https://martinfowler.com/bliki/TwoHardThings.html ...

Since key_separator leads to confusion and dimensionSeparator/dimension_separator is at least used in some other implementation, I've updated the description with dimension_separator.

On the community call last night, there were no objections to moving forward with the .zarray addition, so I'll open a v2 spec PR now.

I'm a bit more hesitant about defining the heuristic as part of the specification (cf. @jbms comment) I'll leave this open for a discussion of where first-chunk writing falls on the MAY/SHOULD/MUST spectrum.

Various implementations allow for defining the separator between the dimension indexes when writing chunks: * n5-zarr defines a `dimensionSeparator` parameter; * zarr-python's NestedDirectoryStore does so by default * and FSStore provides a `key_separator` parameter; * tensorstore has a `key_encoding` parameter; and * jzarr is looking to add the same functionality. When writing an array, it is straight-forward to set this separator and have arrays properly configured. Consumers of such arrays, however, must either know *a priori* if their arrays use a non-default separator or must loop through all possible chunks keys searching for the right one. By defining adding an optional metadata key to the .zarray, we: * preserve the efficient configuration of arrays * while keeping the v2 spec backwards compatible. The primary downsides are that this will be the first optional metadata value in the v2 spec and therefore we don't have a strong understanding of how that will play out, and datasets which were previously written with non-default separators will need updating in order to enable the detection though that is no worse than the current situation.

* v2 spec: add optional dimension_separator (see #707) Various implementations allow for defining the separator between the dimension indexes when writing chunks: * n5-zarr defines a `dimensionSeparator` parameter; * zarr-python's NestedDirectoryStore does so by default * and FSStore provides a `key_separator` parameter; * tensorstore has a `key_encoding` parameter; and * jzarr is looking to add the same functionality. When writing an array, it is straight-forward to set this separator and have arrays properly configured. Consumers of such arrays, however, must either know *a priori* if their arrays use a non-default separator or must loop through all possible chunks keys searching for the right one. By defining adding an optional metadata key to the .zarray, we: * preserve the efficient configuration of arrays * while keeping the v2 spec backwards compatible. The primary downsides are that this will be the first optional metadata value in the v2 spec and therefore we don't have a strong understanding of how that will play out, and datasets which were previously written with non-default separators will need updating in order to enable the detection though that is no worse than the current situation. * Update dim. sep. description after feedback * Remove `MUST NOT` restriction for other keys

joshmoore · 2021-09-22T13:06:11Z

I consider the dimension_separator saga complete! whew

joshmoore mentioned this issue Mar 26, 2021

Support nested chunk storage gzuidhof/zarr.js#88

Closed

joshmoore mentioned this issue Mar 26, 2021

Include nested/flat tests zarr-developers/zarr_implementations#26

Open

joshmoore closed this as completed Sep 22, 2021

joshmoore mentioned this issue Sep 22, 2021

Nested cloud storage #395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested storage detection in Zarr V2 #707

Nested storage detection in Zarr V2 #707

joshmoore commented Mar 5, 2021 •

edited

Loading

SabineEmbacher commented Mar 7, 2021

jbms commented Mar 15, 2021

joshmoore commented Mar 16, 2021

joshmoore commented Mar 26, 2021

axtimwalde commented Mar 30, 2021

SabineEmbacher commented Apr 6, 2021

joshmoore commented Apr 8, 2021

joshmoore commented Sep 22, 2021

Nested storage detection in Zarr V2 #707

Nested storage detection in Zarr V2 #707

Comments

joshmoore commented Mar 5, 2021 • edited Loading

Proposal

Points for discussion:

SabineEmbacher commented Mar 7, 2021

jbms commented Mar 15, 2021

joshmoore commented Mar 16, 2021

joshmoore commented Mar 26, 2021

axtimwalde commented Mar 30, 2021

SabineEmbacher commented Apr 6, 2021

joshmoore commented Apr 8, 2021

joshmoore commented Sep 22, 2021

joshmoore commented Mar 5, 2021 •

edited

Loading