v3: use alternative data type syntax #131

jbms · 2022-02-08T18:36:30Z

The current data type identifiers in zarr v3 are similar to, but not identical to, NumPy type strings.

However, the NumPy type string syntax has some unfortunate limitations:

No natural way to specify bfloat16 or other floating point type variants.
No natural way to specify sizes that are not a multiple of 8 bits, e.g. a 4-bit or 12-bit type.

Of course it is still possible to extend the current naming scheme with arbitrary additions, so the current scheme does not impose any real limitation, but a different naming scheme would allow for greater consistency:

bool
int8
uint8
int16le
int16be
uint16be
float16le
float16be
float64le
float64be

etc.

The text was updated successfully, but these errors were encountered:

jakirkham · 2022-03-23T19:22:23Z

Makes sense

Maybe we can follow something like stdint.h? Though that doesn't handle endianness I don't think

Edit: On Endianness, perhaps this approach from Boost would be worth looking at

jbms · 2022-03-23T20:57:51Z

For the base data type names, bool, {uint,int}{8,16,32,64}, float{16,32,64} and bfloat16 are pretty obvious canonical choices and the integer names are consistent with stdint.h and also are consistent with the numpy "names" rather than type strings.

For endianness there is less standardization. There is also the question of whether the endianness should be included in the zarr data type at all; arguably it is more a property of how the data is encoded in the chunks, which is something that zarr is intended to abstract over. For example, if you use a codec based on an image format, like png, jpeg, etc., then it isn't useful to be able to specify an additional endianness, since when encoding and decoding the image format you would always use native endianness.

If we go in that direction, the zarr dtype could avoid specifying endianness, and instead that would be specified in the chain of codecs, e.g.:

{
  "dtype": "float32",
  "codecs": [{"id": "endian", "endian": "big"}, {"id": "blosc", "cname": "lz4"}]
}

Then if using an image codec, we would instead have something like:

{
  "dtype": "uint16",
  "codecs": [{"id": "png"}]
}

Reading from such an array using zarr-python would always return data in native endianness, since naturally when reading the chunks would be decoded. I could imagine in rare cases the user may desire to keep the data in the same endianness in which it is encoded, e.g. if they are copying between two zarr arrays in the same non-native endianness. That is a case where the existing zarr v2 approach of encoding the endianness as part of the dtype is probably a better fit. But I do think that may be a rare use case.

If we do still want to encode the endianness as part of the zarr dtype name, you raise a good point that in addition to the be/le suffix there is the Boost.Endian big_/little_ prefix as an option. I haven't seen big_/little_ used as often as the be/le suffix, though.

jbms · 2022-03-23T21:03:18Z

The Apache Arrow spec came up in the meeting in the context of awkward arrays, and it may indeed be interesting to see to what extent integrating with its data model makes sense. That could also have implications for data type naming.

Apache Arrow does not include endianness as part of the data type --- instead I believe a single endianness is specified at a higher level for an entire batch of serialized data.

Apache Arrow uses the normal bool, {uint,int}{8,16,32,64}, float{32,64} naming scheme for the base numerical data types.

jakirkham · 2022-03-23T21:43:19Z

cc @QuLogic (who may have thoughts on endianness)

jbms · 2022-03-24T16:33:00Z

Thinking about this more:

I do think it makes sense to decouple endianness from the zarr data type itself for the reasons previously mentioned.

However, the current v3 proposal does not support a chain of codecs, and it would be nice to decouple the data type naming from that issue.

In my view, big endian is not widely used, and most of the cases where it is used, are only due to choosing it as a default "network byte order", and would actually be better served by little endian.

Therefore, we can can simply say for now that all data types are encoded as little endian. In the future when a chain of codecs is supported then there could be support for big endian encoding, if it is needed, via an "endian" codec.

jstriebel · 2022-11-16T15:43:00Z

I believe this is fixed by #155, feel free to re-open this issue if this was not the case.

jakirkham mentioned this issue May 5, 2022

RFC: add data type inspection utilities to the array API specification data-apis/array-api#425

Closed

joshmoore mentioned this issue May 6, 2022

Invitation to Zarr Implementation Council (jzarr) zarr-developers/governance#29

Closed

4 tasks

jbms mentioned this issue Jul 25, 2022

ZEP0001 - Core v3.0 spec for review #149

Closed

jstriebel closed this as completed Nov 16, 2022

jstriebel mentioned this issue Nov 16, 2022

Use Arrow C data interface format strings? #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3: use alternative data type syntax #131

v3: use alternative data type syntax #131

jbms commented Feb 8, 2022

jakirkham commented Mar 23, 2022 •

edited

Loading

jbms commented Mar 23, 2022

jbms commented Mar 23, 2022

jakirkham commented Mar 23, 2022

jbms commented Mar 24, 2022

jstriebel commented Nov 16, 2022

v3: use alternative data type syntax #131

v3: use alternative data type syntax #131

Comments

jbms commented Feb 8, 2022

jakirkham commented Mar 23, 2022 • edited Loading

jbms commented Mar 23, 2022

jbms commented Mar 23, 2022

jakirkham commented Mar 23, 2022

jbms commented Mar 24, 2022

jstriebel commented Nov 16, 2022

jakirkham commented Mar 23, 2022 •

edited

Loading