-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2 spec: add optional dimension_separator (see #707) #715
Conversation
Various implementations allow for defining the separator between the dimension indexes when writing chunks: * n5-zarr defines a `dimensionSeparator` parameter; * zarr-python's NestedDirectoryStore does so by default * and FSStore provides a `key_separator` parameter; * tensorstore has a `key_encoding` parameter; and * jzarr is looking to add the same functionality. When writing an array, it is straight-forward to set this separator and have arrays properly configured. Consumers of such arrays, however, must either know *a priori* if their arrays use a non-default separator or must loop through all possible chunks keys searching for the right one. By defining adding an optional metadata key to the .zarray, we: * preserve the efficient configuration of arrays * while keeping the v2 spec backwards compatible. The primary downsides are that this will be the first optional metadata value in the v2 spec and therefore we don't have a strong understanding of how that will play out, and datasets which were previously written with non-default separators will need updating in order to enable the detection though that is no worse than the current situation.
So presumably you would be able to still pass the separator argument OR look for the metadata key. The first would take priority and would remain the recommendation for the current and older versions of zarr and for datasets that were written by them. |
@martindurant: exactly. Wasn't sure if that was too much for this paragraph, but I can look for a spot to go into more details. I imagine if someone passes a separator that doesn't match the expected one we could warn in mode="r". For mode="a|w", I'd personally favor an exception but an alternative would be to update the metadata, effectively hiding the existing chunks. |
It would be reasonable to supply a helper function somewhere to convert an existing array from one style of metadata to another. |
I'm happy with this and can add support to tensorstore and neuroglancer. Is this the confirmed metadata format or are we waiting on additional discussion? I'd rather not add this only to then have to also support something else in the future. |
I think we are hoping for more +1 from the other maintainers. |
The change in the spec says this:
I think the last part "that in most implementations produces a directory structure" should be removed since it is inserting implementation claims into what should be an implementation free document. |
Would different wording help, like "in some implementation is a special character used to mark directories" or "the directory delimiter in posix-like storage", or "commonly used as a directory delimiter" ? |
Why is it needed at all? |
I like including an example of a non-default value because provides a succinct motivation for why this feature exists. The Zarr spec already has a number of such examples for explaining other features. |
Reasonable. Then ignore this suggested change. Thanks. |
Note: 5 days have passed. https://github.com/zarr-developers/governance/blob/master/GOVERNANCE.md states
Considering this is an optional addition, I'd propose that we leave this call for objections open until the end of the week and then start moving forward? (Any objections to that plan today please. 😉 ) |
Can the separator be a UTF-8 character => multi-byte? |
I think that's a valid question. I vote for UTF-8 arbitrary length string. ASCII encouraged. |
For simplicity it might be best to limit it to one of a small set of characters for now, e.g. {"." and "/"}, and possibly expand that later if there is a need for it. On the implementation side it is not terribly difficult to just concatenate whatever arbitrary string is specified, but there are a few ways things can go wrong:
While in general the user should be free to make a bad choice and suffer the consequences, in some cases software that processes zarr may be intended to handle an untrusted zarr volume and then care must be taken to handle these issues. Unless there is currently a use case for a separator other than {"." and "/"} it seems like it would be reasonable to restrict it to those options for now. |
Also by putting it into .zarray, it implies that the separator can differ for each |
Putting the separator into As I imagine implementing this, and generally in what seems like the most natural implementation, there wouldn't be any problem with different separators within the same hierarchy. However I see that having different separators within the same hierarchy would not work with the existing implementation of NestedDirectoryStore. However I would suppose that once this feature is implemented NestedDirectoryStore would no longer be needed except for API compatibility. |
I like that it can be different for each dataset, it's simpler to parse and may be useful. |
I would suggest for version 3 that you introduce a single "superblock" object, ".zarr" say, |
This is somewhat off-topic, but it sounds like the real unit here is the .zarray+.zattr+chunks. |
I used the term dataset for 'array' as in HDF5 and N5, sorry for the confusion. I like the term 'container' for the entirety of groups and arrays that belong to the same root. That would be the equivalent of an HDF5 file. |
The separator should be individually adjustable for each array. A zarr data set can thus also be compiled manually from already existing data. |
Having some experience dealing with arbitrary unicode both from an implementation and security perspective I would largely parrot @jbms. Many of us are using, or have the desire to use, Zarr in a trusted environments and in particular on large shared filesystems. Allowing arbitrary unicode dimension separators sets up a whole series of cross-language programming safety, path normalization, and subpath containment scenarios that must be dealt with to be safe in such an environment. Having a defined set of allowed ASCII separator characters is both easier for implementers and safer. |
…developers#715) * All top-level storage classes now take an optional `dimension_separator` parameter which defaults to `None`, but can also be `.` or `/`. * A ValueError is raised at normalization time if this is not the case. * `None`s are normalized to the default of `.` in all except the NestedDirectoryStore case. * The value is stored as `self._dimension_separator` on participating classes so that array creation can lookup the value. * This value deprecates the `key_separator` value from FSStore. * Wrapper classes like LRUCacheStore and ConsolidatedMetadataStore *do not* follow this pattern and instead rely on the value in the underlying store.
Currently tensorstore does fail if |
That restriction should probably be removed from the spec and replaced |
This implements the spec change described here: zarr-developers/zarr-python#715 This also adds a query string parameter by the same name. Fixes #241.
Thanks for the suggestion, @DennisHeimbigner. I've pushed: -Other keys MUST NOT be present within the metadata object.
+Other keys SHOULD NOT be present within the metadata object and SHOULD be
+ignored by implementations. |
* feat(zarr): support dimension_separator in .zarray file This implements the spec change described here: zarr-developers/zarr-python#715 This also adds a query string parameter by the same name. Fixes google#241. * feat(multiscale mesh): add additional debugging checks * fix(multiscale mesh): turn off multiscale mesh fragment debugging by default This was mistakenly left on by default. Co-authored-by: Jeremy Maitin-Shepard <[email protected]>
The zarr-python implementation (#716) is finally green. I'd propose we try to get these both merged this week, unless anyone else who is working on an implementation needs more time. |
Some additional transcribed notes from @alimanfoo:
|
…#716) * Implement `dimension_separator` for Python storage classes (See #715) * All top-level storage classes now take an optional `dimension_separator` parameter which defaults to `None`, but can also be `.` or `/`. * A ValueError is raised at normalization time if this is not the case. * `None`s are normalized to the default of `.` in all except the NestedDirectoryStore case. * The value is stored as `self._dimension_separator` on participating classes so that array creation can lookup the value. * This value deprecates the `key_separator` value from FSStore. * Wrapper classes like LRUCacheStore and ConsolidatedMetadataStore *do not* follow this pattern and instead rely on the value in the underlying store. * Only store `dimension_separator` if not None All hexdigest tests were failing due to updated array metadata. In the case of NestedDirectoryStore and N5Store, this is necessary. If the dimension_separator key is excluded from the .zarray JSON when None, then most standard tests continue to pass. * Fix doctests with optional key * Add separator to missed LDBMStore * Fix linting issue * De-deprecate key_separator as public, non-null API * Add test for normalize_dim_sep to appease codecov * More tests for codecov * Remove key from n5 array metadata * Fix minor typo * Cleanup DIGESTS in test_core.py * Fix cut-n-paste error in test_utils.py * And hopefully on last codecov fix * Apply review changes * Add 2.8.0 release notes
* feat: add state saving/sharing ui allowing users to send their state to a server that accepts a json encoded version of the neuroglancer state. the "state server" is expected to return a url that will respond that same json string * feat(zarr): support dimension_separator in .zarray file This implements the spec change described here: zarr-developers/zarr-python#715 This also adds a query string parameter by the same name. Fixes google#241. * feat(multiscale mesh): add additional debugging checks * fix(multiscale mesh): turn off multiscale mesh fragment debugging by default This was mistakenly left on by default. * simplified state sharing code. replaced dialog with a state server config file and a select element if there is more than 1 server * moved state share code out of viewer into separate file, renamed state key for the selected state server * feat(ui): add support for configurable side panels Previously, only two side panels were supported, the layer side panel and the selection details side panel. Only a static configuration was supported, a single column on the right, with the layer side panel on top and the selection details below, and the vertical height divided equally. This commit adds support for an arbitrary number of side panels, to the left/right/top/bottom of the main data panels. Panels can be moved via drag and drop, and the size can also be adjusted. The side panels for multiple layers can now be shown at once, and individual tabs of side panels can also be dragged to separate panels. * feat: add layer list panel and support for "archived" layers Archived layers are shown only in the new layer list panel; they are not visible and are not shown in the layer bar. They may be useful in the case where there are a large number of layers that would otherwise clutter up the layer bar. This commit also improves the visual feedback during layer and drop operations. * feat(python): add ConfigState options for additional panel buttons * fix(python): fix ManagedLayer.visible property * fix(annotation): only bind properties actually referenced by the shader Annotation rendering maps properties to shader vertex attributes, and WebGL implementations limit the number of attributes to ~16. Furthermore, previously all properties were bound even if not referenced. With this commit, only properties actually referenced by the shader are bound. This allows annotation layers to support data sources with a large number of properties, provided that only a small number are actually referenced by the shader at once. * fix(annotation): fix rendering tab to better handle a large number of properties * fix(annotation): prevent collapse of annotation property list * fix(perspective_panel): eliminate separate transparentPickEnabled behavior Segmentation layers have a "pick" boolean option that previously controlled two things: 1. When off, double clicking does not select/deselect segments in the layer. 2. When on, picking of transparent meshes behaves exactly the same as if the mesh were opaque --- i.e. it is not possible to pick objects behind/inside the mesh. When off, picking of transparent meshes has lower precendence than opaque objects behind. In practice, the picking behavior for transparent objects with "picking on", which made it impossible to pick opaque objects behind, is not very useful. With this change, the second behavior change no longer applies: transparent objects now always have lower picking precendence than opaque objects, making it possible to pick opaque objects behind transparent objects. * fix(ui/side_panel): only allow dragging from title bar This allows slider controls in the body of side panels to work correctly. Also disables auto-select on focus behavior for the name input field of the layer side panels. The previous auto-select on focus behavior meant that attempting to start a drag from the text area (which takes up most of the title bar and is the most natural place to drag from) worked, but showed the text itself being dragged rather than the entire panel. Co-authored-by: Chris Jordan <[email protected]> Co-authored-by: Jeremy Maitin-Shepard <[email protected]> Co-authored-by: Hannah Gooden <[email protected]>
Historically zarr has named chunk files using dots by default, e.g. "1.2.3", but has also supported slashes, e.g. "1/2/3". However, the choice was not indicated in the .zarray metadata file, and instead needed to be specified out-of-band. In TensorStore, that was accomplished using the "key_encoding" spec member. Zarr recently added a dimension_separator field to the metadata format: zarr-developers/zarr-python#715 This commit adds support for that field. Note that previously TensorStore did not accept .zarray files that contain unknown members, meaning that reading arrays with the dimension_separator field would fail. The existing "key_encoding" spec member is still supported, but deprecated in favor of specifying "dimension_separator" as part of the "metadata" field. When creating a new array, TensorStore will always write the "dimension_separator" field, in order to allow the array to be re-opened without specifying the "dimension_separator". Additionally, this change also makes TensorStore accept (and preserve) extra .zarray members, as the zarr spec has now also changed to require that. PiperOrigin-RevId: 384998911 Change-Id: I485141b7786084d57c28005e3076f70462cd5b9e
This implements the spec change described here: zarr-developers/zarr-python#715 This also adds a query string parameter by the same name. Fixes #241.
Various implementations allow for defining the separator between the
dimension indexes when writing chunks:
dimensionSeparator
parameter;key_separator
parameter;key_encoding
parameter; andWhen writing an array, it is straight-forward to set this separator
and have arrays properly configured. Consumers of such arrays,
however, must either know a priori if their arrays use a
non-default separator or must loop through all possible chunks keys
searching for the right one.
By defining adding an optional metadata key to the .zarray, we:
The primary downsides are that this will be the first optional metadata
value in the v2 spec and therefore we don't have a strong understanding of how that will play out, and datasets which were previously written with non-default separators will need updating in order to enable the detection though that is no worse than the current situation.
cc: @alimanfoo @chris-allan @SabineEmbacher @jbms @axtimwalde @manzt @gzuidhof @haileyajohnson @DennisHeimbigner @zarr-developers/core-devs
(If I've missed any Zarr v2 implementors, please add them)