-
-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wrappers for numcodecs in v3 #1839
Conversation
Thanks @normanrz, personally I think these changes are good, and thanks for working to bridge numcodecs into v3. zarr / numcodecs relationshipAs you note, there is a circular dependency issue if we wanted to put zarr v3 data structures in
The last option is the most work, but the cleanest result in my opinion. curious to hear if there are any other ideas. issues with the v3 codec specI think this PR exposes some issues with how codecs are defined in the v3 spec. @normanrz wrote:
Does the spec really only list a few official codecs? the v3 spec states (emphasis mine)
If the true intention of the spec is to list a few officially supported codecs, then we must change the quoted text from the spec, because right now it explicitly says that the spec does not define a list of codecs. Also, I'm quite confused about requiring that the codec identifier be a URI. In your example, "https://zarr.dev/numcodecs/fixedoffsetscale" does not resolve to a human readable specification of the codec, so it's not spec-compliant. Indeed If we want We should strive to make zarr hierarchy validation robust -- it should be possible to validate a zarr hierarchy without an internet connection. The language about URIs makes this impossible right now. And I think if the spec contains language that encourages people to make fake URIs that don't actually point to anything, then that's another sign of a correctable defect in the spec. I would be very interested in amending the spec to fix this problem. |
FWIW I've been confused about these issues also, and agree with @d-v-b From my perspective, an awful lot of what people do is covered by Blosc, so giving it special status in the spec might be pragmatic? |
I agree that the spec could use some clarification in that regard. @d-v-b I am not sure whether you are advocating for a closed list of codec (i.e. blosc, gzip etc.) or an entirely open set? I think a closed list that are expected to be implemented by all libraries is desirable. An open set would make validation and interoperability a lot more complicated. Providing flexibility for experimenting with other codecs through clearly namespaced codecs seems like a good compromise, though. Whether the URI needs to point to a human-readable spec or just needs to be a unique URI is debatable. I assume you would like to see that URI resolve to a JSON schema or similar for validation purposes.
That is correct. We could easily add a redirect at that URI that points to the numcodecs docs, though. I wanted to pick a URI prefix that is owned by the Zarr community (ie. not https://numcodecs.readthedocs.io). Regarding the numcodecs <> zarr relationship, I am leaning towards moving the |
Would it make sense for us to switch from ABCs to Protocols for the Codecs? I believe that would help us avoid making Zarr a dependency of Numcodecs. |
I made a prototype what the integration within |
There is quite a bit of implementation in the ABCs which makes it harder to switch them to Protocols. |
If the zarr v3 spec defines a closed set of codecs, then we can't develop new codecs without changing the spec. That seems very undesirable. So I think the alternative (the zarr v3 spec does not define a closed set of codecs) is our only option. That doesn't mean we shouldn't standardize codecs, simply that the list of zarr codecs should not be part of the zarr spec. I think a list of codecs is actually very useful for implementations, especially if they are specified in a language-agnostic way. But I'm not convinced that there's value in forcing implementations to implement N different codecs in order to be "real" zarr implementations. We should just have a list of codec specifications, and zarr implementations (or non-zarr applications that just want a nice JSON serialization of compression routines) can implement the codecs that are useful for them. |
A particular problem here is there is much less maintenance effort on Blosc1 compared to Blosc2. I recommend that we view the Blosc1 frame format as deprecated. |
We should probably move this discussion to zarr-specs because it is not really about the zarr-python impl. I feel quite strongly, that non-standard codecs need to be labeled as such (e.g. through URI-style naming instead of short names). Having multiple codecs (even if the encoded format is only slightly different) with the same name would be a desaster. Perhaps |
Agreed, I will write up my concerns in a zarr-specs issue. |
see zarr-developers/zarr-specs#293 for a discussion of codecs in the v3 spec |
Circling back to the original discussion of this PR.
I am actually quite pleased how the implementation on the numcodecs side looks like. I am inclined to close this PR in favor of zarr-developers/numcodecs#524. What do you all think? |
Closing this in favor of zarr-developers/numcodecs#524 |
The Zarr v3 specification only lists a few codecs that are officially supported. However, it is desirable to expose the codecs in numcodecs for use with v3 arrays as well. This PR adds wrapper classes for numcodecs support.
The name of the codecs is prefixed with
https://zarr.dev/numcodecs/
to avoid naming collisions in case some codecs of numcodecs get added to the Zarr spec. Also, there is a warning that numcodecs codecs are not officially supported and will likely not work in any other Zarr implementation.Most array-to-array ("filters") and bytes-to-bytes codecs are supported. Absent are the variable-length codecs as well as json, msgpack and pickle.
Here is an example of the persisted configuration:
This PR adds a fairly close coupling with the contents of numcodecs. It is required to accommodate the new codec API in zarr-python. However, this coupling is unfortunate. Ideally, this code could be moved to the numcodecs package. Due to the use of the Codec ABCs and helper functions, this would introduce a circular dependency. Not sure if it would be possible to make it a conditional module within numcodecs. Happy to hear any thoughts on this.
Use of numcodecs in v2 arrays is not affected.
TODO: