Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] Fixed-width unicode string support in zarr v3 #2347

Open
TomAugspurger opened this issue Oct 12, 2024 · 1 comment
Open

[v3] Fixed-width unicode string support in zarr v3 #2347

TomAugspurger opened this issue Oct 12, 2024 · 1 comment
Labels
bug Potential issues with the zarr-python library
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 12, 2024

Zarr version

v3

Numcodecs version

na

Python Version

na

Operating System

na

Installation

na

Description

Mentioned in #2323 (comment), right now we can't create a fixed-width string dtype in zarr v3.

In [1]: import zarr

In [2]: arr = zarr.create(shape=(3,), dtype="U3")

In [3]: arr[:] = ['a', 'bb', 'ccc']

In [4]: arr[:]
Out[4]: array(['a', 'bb', 'ccc'], dtype=StringDType())

We would want the NumPy dtype of that array to be U3, a fixed-width unicode string dtype. We'd want to support this in addition to the variable width strings being used currently. Some initial questions I don't know the answer to:

  1. What data_type shows up in the metadata?
  2. What codecs are needed?
  3. How are the actual bytes stored? In parquet, fixed_len_byte_array is one of the primitive types.

Steps to reproduce

.

Additional output

No response

@TomAugspurger TomAugspurger added the bug Potential issues with the zarr-python library label Oct 12, 2024
@jhamman jhamman added this to the 3.0.0 milestone Oct 18, 2024
@jhamman jhamman changed the title Fixed-width unicode string support in zarr v3 [v3] Fixed-width unicode string support in zarr v3 Oct 18, 2024
@jhamman jhamman moved this to Todo in Zarr-Python - 3.0 Oct 18, 2024
@jhamman jhamman added the V3 label Oct 18, 2024
@TomAugspurger
Copy link
Contributor Author

I'm not sure if I'll be able to work on this, but here's some notes on the V2 behavior, and some things:

>>> import numpy as np
>>> import zarr
>>> import json

>>> b = np.array([b'a', b'bb', b'ccc'])
>>> u = np.array(['a', 'bb', 'ccc'])
>>> store = {}

>>> zarr.array(b, store=store, path="bytes", compressor=None)
>>> zarr.array(u, store=store, path="unicode", compressor=None)

>>> print(json.loads(store['bytes/.zarray'])['dtype'])
# |S3


>>> print(json.loads(store['unicode/.zarray'])['dtype'])
# <U3

assert store['bytes/0'] == b.tobytes()
assert store['unicode/0'] == u.tobytes()

NumPy uses 32-bit UCS-4 codepoints for Unicode data ref. (I think that len(u.tobytes()) is something like 4 bytes per character * 3, since that's the fixed width). For bytes data, it uses the ASCII values padded with null bytes to the fixed width.

@jhamman jhamman modified the milestones: 3.0.0, After 3.0.0 Dec 2, 2024
@dstansby dstansby removed the V3 label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
Status: Todo
Development

No branches or pull requests

3 participants