Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't create big endian dtypes in V3 array #2324

Open
rabernat opened this issue Oct 9, 2024 · 7 comments
Open

Can't create big endian dtypes in V3 array #2324

rabernat opened this issue Oct 9, 2024 · 7 comments
Labels
bug Potential issues with the zarr-python library
Milestone

Comments

@rabernat
Copy link
Contributor

rabernat commented Oct 9, 2024

This works with V2 data:

zarr.create(shape=10, dtype=">i2", zarr_version=2)
# -> <Array memory://4413530368 shape=(10,) dtype=>i2>

But raises for V3

zarr.create(shape=10, dtype=">i2", zarr_version=3)
File ~/gh/zarr-developers/zarr-python/src/zarr/codecs/__init__.py:40, in _get_default_array_bytes_codec(np_dtype)
     37 def _get_default_array_bytes_codec(
     38     np_dtype: np.dtype[Any],
     39 ) -> BytesCodec | VLenUTF8Codec | VLenBytesCodec:
---> 40     dtype = DataType.from_numpy(np_dtype)
     41     if dtype == DataType.string:
     42         return VLenUTF8Codec()

File ~/gh/zarr-developers/zarr-python/src/zarr/core/metadata/v3.py:599, in DataType.from_numpy(cls, dtype)
    581     return DataType.bytes
    582 dtype_to_data_type = {
    583     "|b1": "bool",
    584     "bool": "bool",
   (...)
    597     "<c16": "complex128",
    598 }
--> 599 return DataType[dtype_to_data_type[dtype.str]]

KeyError: '>i2'

In the V3 spec, endianness is now handled by a codec: https://zarr-specs.readthedocs.io/en/latest/v3/codecs/bytes/v1.0.html

Xarray tests create data with big endian dtypes, and Zarr needs to know how to handle them.

@d-v-b
Copy link
Contributor

d-v-b commented Oct 9, 2024

If the codecs are unspecified, then I think we could automatically parametrize the BytesCodec based on the dtype. If the codecs are specified and the BytesCodec endianness doesn't match the endianness of the data, then we raise an exception.

But a bigger problem is that, by making endianness a serialization detail, the zarr dtype model has diverged from the numpy dtype model. If our array object uses zarr v3 data type semantics, then zarr.create(..., dtype=">i2") will return an array with dtype <i2 + a special bytes codec. From the POV of functions like np.array_like, this zarr array will not have its "real" dtype; users might be surprised to see that zarr.create(..., dtype=">i2") and zarr.create(..., dtype="<i2") returns arrays with the same dtype. I don't see an easy solution to this.

@rabernat
Copy link
Contributor Author

One solution could be to always translate the endianness of the on-disk data to the endianness of the in-memory data. This could be done within BytesCodec. However, it would be hard, since endianness is not part of ArraySpec.

@dstansby
Copy link
Contributor

Looks like this either needs resolving, or documenting as a breaking change at #2596 for zarr 3

@normanrz
Copy link
Member

normanrz commented Jan 7, 2025

Should we put endianness in the new runtime ArrayConfig? We could parse the dtype to set it.

@jhamman jhamman modified the milestones: 3.0.0, After 3.0.0 Jan 8, 2025
@jhamman
Copy link
Member

jhamman commented Jan 8, 2025

I've moved this to "After 3.0.0" and will be adding this to the work in progress section of the v3 migration docs.

@astrofrog
Copy link

I'm running into this too - just to check, is this something that is going to be fixed in the 3.0.x series of releases, or is it a breaking change that will not be changed that we should adjust existing code to?

@d-v-b
Copy link
Contributor

d-v-b commented Feb 1, 2025

I think we intend to fix this, but it will force us to revise the semantics of the Array.dtype attribute. The alternative to handling endianness the way users expect is unacceptable IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

6 participants