[v3] Fixed-width unicode string support in zarr v3 #2347

TomAugspurger · 2024-10-12T19:32:27Z

Zarr version

v3

Numcodecs version

na

Python Version

na

Operating System

na

Installation

na

Description

Mentioned in #2323 (comment), right now we can't create a fixed-width string dtype in zarr v3.

In [1]: import zarr

In [2]: arr = zarr.create(shape=(3,), dtype="U3")

In [3]: arr[:] = ['a', 'bb', 'ccc']

In [4]: arr[:]
Out[4]: array(['a', 'bb', 'ccc'], dtype=StringDType())

We would want the NumPy dtype of that array to be U3, a fixed-width unicode string dtype. We'd want to support this in addition to the variable width strings being used currently. Some initial questions I don't know the answer to:

What data_type shows up in the metadata?
What codecs are needed?
How are the actual bytes stored? In parquet, fixed_len_byte_array is one of the primitive types.

Steps to reproduce

.

Additional output

No response

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2024-11-15T15:42:31Z

I'm not sure if I'll be able to work on this, but here's some notes on the V2 behavior, and some things:

>>> import numpy as np
>>> import zarr
>>> import json

>>> b = np.array([b'a', b'bb', b'ccc'])
>>> u = np.array(['a', 'bb', 'ccc'])
>>> store = {}

>>> zarr.array(b, store=store, path="bytes", compressor=None)
>>> zarr.array(u, store=store, path="unicode", compressor=None)

>>> print(json.loads(store['bytes/.zarray'])['dtype'])
# |S3


>>> print(json.loads(store['unicode/.zarray'])['dtype'])
# <U3

assert store['bytes/0'] == b.tobytes()
assert store['unicode/0'] == u.tobytes()

NumPy uses 32-bit UCS-4 codepoints for Unicode data ref. (I think that len(u.tobytes()) is something like 4 bytes per character * 3, since that's the fixed width). For bytes data, it uses the ASCII values padded with null bytes to the fixed width.

dstansby · 2024-12-30T17:41:05Z

Given it doesn't look like this funcitonality will get into 3.0, it looks like this breaking change is something else to add to #2596

h-mayorquin · 2025-01-29T20:22:27Z

Leaving a link to this here as I did not find this on the page:

https://hackmd.io/@ivirshup/SkdO2szas
zarr-developers/zeps#47

TomAugspurger added the bug Potential issues with the zarr-python library label Oct 12, 2024

jhamman mentioned this issue Oct 12, 2024

[v3] String support for v3 array #2268

Closed

jhamman added this to Zarr-Python - 3.0 Oct 18, 2024

jhamman added this to the 3.0.0 milestone Oct 18, 2024

jhamman mentioned this issue Oct 18, 2024

Tracking 3.0 Release Blockers #2412

Closed

33 tasks

jhamman changed the title ~~Fixed-width unicode string support in zarr v3~~ [v3] Fixed-width unicode string support in zarr v3 Oct 18, 2024

jhamman moved this to Todo in Zarr-Python - 3.0 Oct 18, 2024

jhamman added the V3 label Oct 18, 2024

This was referenced Nov 1, 2024

Monthly issue metrics report #2455

Closed

Monthly issue metrics report sanketverma1704/zarr-python#3

Open

jhamman modified the milestones: 3.0.0, After 3.0.0 Dec 2, 2024

dstansby removed the V3 label Dec 12, 2024

bendichter mentioned this issue Jan 9, 2025

[Feature]: Support zarr-python v3 hdmf-dev/hdmf-zarr#202

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v3] Fixed-width unicode string support in zarr v3 #2347

[v3] Fixed-width unicode string support in zarr v3 #2347

TomAugspurger commented Oct 12, 2024 •

edited

Loading

TomAugspurger commented Nov 15, 2024

dstansby commented Dec 30, 2024

h-mayorquin commented Jan 29, 2025 •

edited

Loading

[v3] Fixed-width unicode string support in zarr v3 #2347

[v3] Fixed-width unicode string support in zarr v3 #2347

Comments

TomAugspurger commented Oct 12, 2024 • edited Loading

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

TomAugspurger commented Nov 15, 2024

dstansby commented Dec 30, 2024

h-mayorquin commented Jan 29, 2025 • edited Loading

TomAugspurger commented Oct 12, 2024 •

edited

Loading

h-mayorquin commented Jan 29, 2025 •

edited

Loading