Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested storage detection in Zarr V2 #707

Closed
joshmoore opened this issue Mar 5, 2021 · 8 comments
Closed

Nested storage detection in Zarr V2 #707

joshmoore opened this issue Mar 5, 2021 · 8 comments

Comments

@joshmoore
Copy link
Member

joshmoore commented Mar 5, 2021

see: ome/ngff#29 and bcdev/jzarr#17

In order to better handle Zarr arrays created with NestedDirectoryStorage or FSStore(key_separator="/"), SabineEmbacher and I have been working on a "protocol heuristic" that can be used by V2 implementations to detect nested chunking rather than requiring the user to specify it correctly.

tl;dr: This proposes a new key for .zarray which it would be good to have feedback on.


Proposal

When creating a zarr array:

  • add to .zarray json: {"dimension_separator": "/"}
  • always write a 0-position chunk

When opening an array:

  • try to read the separator character from the .zarray json
  • if not available, try to find a 0-position chunk
  • if not available, at every read action try to find chunks with both standard variants until the situation is clarified. (Standard separator list ["/", "."])

Points for discussion:

  • The name dimension_separator differs from the code implementation key_separator to reduce confusion about whether every separator in the key name is effected.
  • There has been some discussion (community call, gitter) about whether or not "/" could become the default.
  • Is an addition to the .zarray metadata sufficiently low impact to be rolled into v2?
@SabineEmbacher
Copy link

Thoughts on the name of the separator:
What the separator separates is the respective chunk index per dimension.
See bcdev/jzarr#19 and bcdev/jzarr#17 (comment) and

Therefore the following suggestions:

  • chunk_index_separator
  • chunk_index_sep
  • chunk_idx_separator
  • chunk_idx_sep

@jbms
Copy link

jbms commented Mar 15, 2021

One issue with the detection protocol is that it does not work when writing. While auto-detection is nice it might be better to just assume "." unless the metadata key is present --- any existing arrays could be fixed up, and hopefully all future arrays would have the metadata key. That avoids the added complexity (and inefficiency) of auto-detection.

@joshmoore
Copy link
Member Author

Hi @jbms. You mean a mode="a"-style writing? If so, I see the problem. I had initially intended the auto-detection only as a means of not needing to "standardize" a new key. Then I got greedy and went for both. Anyone else have thoughts?

In the case of a mode="w" situation I would assume the writer simply asserts where it will write.

any existing arrays could be fixed up, and hopefully all future arrays would have the metadata key.

Guess there are some edge cases I worry about here. Maybe there are a number of strategies that can be enabled:

  • auto-detect
  • fail if missing metadata
  • set if missing metadata
  • ...

The real question is likely to be what should be the default.

@joshmoore
Copy link
Member Author

#707 (comment) What the separator separates is the respective chunk index per dimension.

@SabineEmbacher, @axtimwalde suggests "dimension separator".

@SabineEmbacher
Copy link

I did not make my suggestions regarding the name in order to enforce them.
They were simply suggestions.
I agree with any decision. Even if a decision does not correspond to my personal preferences.
That is perfectly fine with me.

Please make a decision and I will put it in asap according to the specification.

@joshmoore
Copy link
Member Author

I did not make my suggestions regarding the name in order to enforce them.

No worries, @SabineEmbacher. The suggestions were definitely useful. It's more just a matter of https://martinfowler.com/bliki/TwoHardThings.html ...

Since key_separator leads to confusion and dimensionSeparator/dimension_separator is at least used in some other implementation, I've updated the description with dimension_separator.

On the community call last night, there were no objections to moving forward with the .zarray addition, so I'll open a v2 spec PR now.

I'm a bit more hesitant about defining the heuristic as part of the specification (cf. @jbms comment) I'll leave this open for a discussion of where first-chunk writing falls on the MAY/SHOULD/MUST spectrum.

joshmoore added a commit to joshmoore/zarr-python that referenced this issue Apr 8, 2021
Various implementations allow for defining the separator between the
dimension indexes when writing chunks:

 * n5-zarr defines a `dimensionSeparator` parameter;
 * zarr-python's NestedDirectoryStore does so by default
 * and FSStore provides a `key_separator` parameter;
 * tensorstore has a `key_encoding` parameter; and
 * jzarr is looking to add the same functionality.

When writing an array, it is straight-forward to set this separator
and have arrays properly configured. Consumers of such arrays,
however, must either know *a priori* if their arrays use a
non-default separator or must loop through all possible chunks keys
searching for the right one.

By defining adding an optional metadata key to the .zarray, we:

 * preserve the efficient configuration of arrays
 * while keeping the v2 spec backwards compatible.

The primary downsides are that this will be the first optional metadata
value in the v2 spec and therefore we don't have a strong understanding
of how that will play out, and datasets which were previously written
with non-default separators will need updating in order to enable the
detection though that is no worse than the current situation.
joshmoore added a commit to joshmoore/zarr-python that referenced this issue Apr 8, 2021
Various implementations allow for defining the separator between the
dimension indexes when writing chunks:

 * n5-zarr defines a `dimensionSeparator` parameter;
 * zarr-python's NestedDirectoryStore does so by default
 * and FSStore provides a `key_separator` parameter;
 * tensorstore has a `key_encoding` parameter; and
 * jzarr is looking to add the same functionality.

When writing an array, it is straight-forward to set this separator
and have arrays properly configured. Consumers of such arrays,
however, must either know *a priori* if their arrays use a
non-default separator or must loop through all possible chunks keys
searching for the right one.

By defining adding an optional metadata key to the .zarray, we:

 * preserve the efficient configuration of arrays
 * while keeping the v2 spec backwards compatible.

The primary downsides are that this will be the first optional metadata
value in the v2 spec and therefore we don't have a strong understanding
of how that will play out, and datasets which were previously written
with non-default separators will need updating in order to enable the
detection though that is no worse than the current situation.
joshmoore added a commit that referenced this issue Apr 21, 2021
* v2 spec: add optional dimension_separator (see #707)

Various implementations allow for defining the separator between the
dimension indexes when writing chunks:

 * n5-zarr defines a `dimensionSeparator` parameter;
 * zarr-python's NestedDirectoryStore does so by default
 * and FSStore provides a `key_separator` parameter;
 * tensorstore has a `key_encoding` parameter; and
 * jzarr is looking to add the same functionality.

When writing an array, it is straight-forward to set this separator
and have arrays properly configured. Consumers of such arrays,
however, must either know *a priori* if their arrays use a
non-default separator or must loop through all possible chunks keys
searching for the right one.

By defining adding an optional metadata key to the .zarray, we:

 * preserve the efficient configuration of arrays
 * while keeping the v2 spec backwards compatible.

The primary downsides are that this will be the first optional metadata
value in the v2 spec and therefore we don't have a strong understanding
of how that will play out, and datasets which were previously written
with non-default separators will need updating in order to enable the
detection though that is no worse than the current situation.

* Update dim. sep. description after feedback

* Remove `MUST NOT` restriction for other keys
@joshmoore
Copy link
Member Author

I consider the dimension_separator saga complete! whew

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants