-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3: use alternative data type syntax #131
Comments
Makes sense Maybe we can follow something like Edit: On Endianness, perhaps this approach from Boost would be worth looking at |
For the base data type names, For endianness there is less standardization. There is also the question of whether the endianness should be included in the zarr data type at all; arguably it is more a property of how the data is encoded in the chunks, which is something that zarr is intended to abstract over. For example, if you use a codec based on an image format, like png, jpeg, etc., then it isn't useful to be able to specify an additional endianness, since when encoding and decoding the image format you would always use native endianness. If we go in that direction, the zarr {
"dtype": "float32",
"codecs": [{"id": "endian", "endian": "big"}, {"id": "blosc", "cname": "lz4"}]
} Then if using an image codec, we would instead have something like: {
"dtype": "uint16",
"codecs": [{"id": "png"}]
} Reading from such an array using zarr-python would always return data in native endianness, since naturally when reading the chunks would be decoded. I could imagine in rare cases the user may desire to keep the data in the same endianness in which it is encoded, e.g. if they are copying between two zarr arrays in the same non-native endianness. That is a case where the existing zarr v2 approach of encoding the endianness as part of the dtype is probably a better fit. But I do think that may be a rare use case. If we do still want to encode the endianness as part of the zarr dtype name, you raise a good point that in addition to the |
The Apache Arrow spec came up in the meeting in the context of awkward arrays, and it may indeed be interesting to see to what extent integrating with its data model makes sense. That could also have implications for data type naming. Apache Arrow does not include endianness as part of the data type --- instead I believe a single endianness is specified at a higher level for an entire batch of serialized data. Apache Arrow uses the normal |
cc @QuLogic (who may have thoughts on endianness) |
Thinking about this more: I do think it makes sense to decouple endianness from the zarr data type itself for the reasons previously mentioned. However, the current v3 proposal does not support a chain of codecs, and it would be nice to decouple the data type naming from that issue. In my view, big endian is not widely used, and most of the cases where it is used, are only due to choosing it as a default "network byte order", and would actually be better served by little endian. Therefore, we can can simply say for now that all data types are encoded as little endian. In the future when a chain of codecs is supported then there could be support for big endian encoding, if it is needed, via an "endian" codec. |
I believe this is fixed by #155, feel free to re-open this issue if this was not the case. |
The current data type identifiers in zarr v3 are similar to, but not identical to, NumPy type strings.
However, the NumPy type string syntax has some unfortunate limitations:
Of course it is still possible to extend the current naming scheme with arbitrary additions, so the current scheme does not impose any real limitation, but a different naming scheme would allow for greater consistency:
etc.
The text was updated successfully, but these errors were encountered: