Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3: use alternative data type syntax #131

Closed
jbms opened this issue Feb 8, 2022 · 6 comments
Closed

v3: use alternative data type syntax #131

jbms opened this issue Feb 8, 2022 · 6 comments

Comments

@jbms
Copy link
Contributor

jbms commented Feb 8, 2022

The current data type identifiers in zarr v3 are similar to, but not identical to, NumPy type strings.

However, the NumPy type string syntax has some unfortunate limitations:

  • No natural way to specify bfloat16 or other floating point type variants.
  • No natural way to specify sizes that are not a multiple of 8 bits, e.g. a 4-bit or 12-bit type.

Of course it is still possible to extend the current naming scheme with arbitrary additions, so the current scheme does not impose any real limitation, but a different naming scheme would allow for greater consistency:

bool
int8
uint8
int16le
int16be
uint16be
float16le
float16be
float64le
float64be

etc.

@jakirkham
Copy link
Member

jakirkham commented Mar 23, 2022

Makes sense

Maybe we can follow something like stdint.h? Though that doesn't handle endianness I don't think

Edit: On Endianness, perhaps this approach from Boost would be worth looking at

@jbms
Copy link
Contributor Author

jbms commented Mar 23, 2022

For the base data type names, bool, {uint,int}{8,16,32,64}, float{16,32,64} and bfloat16 are pretty obvious canonical choices and the integer names are consistent with stdint.h and also are consistent with the numpy "names" rather than type strings.

For endianness there is less standardization. There is also the question of whether the endianness should be included in the zarr data type at all; arguably it is more a property of how the data is encoded in the chunks, which is something that zarr is intended to abstract over. For example, if you use a codec based on an image format, like png, jpeg, etc., then it isn't useful to be able to specify an additional endianness, since when encoding and decoding the image format you would always use native endianness.

If we go in that direction, the zarr dtype could avoid specifying endianness, and instead that would be specified in the chain of codecs, e.g.:

{
  "dtype": "float32",
  "codecs": [{"id": "endian", "endian": "big"}, {"id": "blosc", "cname": "lz4"}]
}

Then if using an image codec, we would instead have something like:

{
  "dtype": "uint16",
  "codecs": [{"id": "png"}]
}

Reading from such an array using zarr-python would always return data in native endianness, since naturally when reading the chunks would be decoded. I could imagine in rare cases the user may desire to keep the data in the same endianness in which it is encoded, e.g. if they are copying between two zarr arrays in the same non-native endianness. That is a case where the existing zarr v2 approach of encoding the endianness as part of the dtype is probably a better fit. But I do think that may be a rare use case.

If we do still want to encode the endianness as part of the zarr dtype name, you raise a good point that in addition to the be/le suffix there is the Boost.Endian big_/little_ prefix as an option. I haven't seen big_/little_ used as often as the be/le suffix, though.

@jbms
Copy link
Contributor Author

jbms commented Mar 23, 2022

The Apache Arrow spec came up in the meeting in the context of awkward arrays, and it may indeed be interesting to see to what extent integrating with its data model makes sense. That could also have implications for data type naming.

Apache Arrow does not include endianness as part of the data type --- instead I believe a single endianness is specified at a higher level for an entire batch of serialized data.

Apache Arrow uses the normal bool, {uint,int}{8,16,32,64}, float{32,64} naming scheme for the base numerical data types.

@jakirkham
Copy link
Member

cc @QuLogic (who may have thoughts on endianness)

@jbms
Copy link
Contributor Author

jbms commented Mar 24, 2022

Thinking about this more:

I do think it makes sense to decouple endianness from the zarr data type itself for the reasons previously mentioned.

However, the current v3 proposal does not support a chain of codecs, and it would be nice to decouple the data type naming from that issue.

In my view, big endian is not widely used, and most of the cases where it is used, are only due to choosing it as a default "network byte order", and would actually be better served by little endian.

Therefore, we can can simply say for now that all data types are encoded as little endian. In the future when a chain of codecs is supported then there could be support for big endian encoding, if it is needed, via an "endian" codec.

@jstriebel
Copy link
Member

I believe this is fixed by #155, feel free to re-open this issue if this was not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants