Add array storage helpers #2065

d-v-b · 2024-08-03T10:37:50Z

This PR adds nchunks, nbytes, and nchunks_initialized functionality from 2.x.

closes #2027
depends on #2064

details

Adds the following to array.py:

(AsyncArray / Array).nchunks : deprecated, the total number of chunks in the array. exists for 2.xx compatibility.
(AsyncArray / Array).cdata_shape : deprecated, the shape of the chunk grid. exists for 2.xx compatibility.
(AsyncArray / Array).nbytes : the total number of bytes that the array can store
(AsyncArray / Array)._iter_chunk_coords : an iterator over tuples of ints which represent positions in the chunk grid
(AsyncArray / Array)._iter_chunk_regions : an iterator over slices which represent the contiguous array region spanned by each chunk
(AsyncArray / Array)._iter_chunk_keys : an iterator over strings which represent the paths in storage for all the chunks
chunks_initialized(array): a function that takes an array and returns a tuple of the chunk keys for that array that exist in storage. this also has tests.
nchunks_initialized(array): deprecated, a function that calls len(chunks_initialized(array)). this exists for 2.xx compatibility.

All of the above _iter_chunk_* methods should be considered private and provisional. I added them because their functionality is valuable, but eventually I think we will have a better array API that renders these methods obsolete. If we think these are cluttering the array API, I'd be happy splitting them off into stand-alone functions.

adds a function iter_grid to indexing.py, this just provides lexicographic iteration over the elements of a bounded N-dimensional, positive grid (e.g., a grid of chunks).

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

d-v-b · 2024-08-03T14:56:57Z

@tomwhite let me know if this looks workable for you

tomwhite · 2024-08-05T09:03:20Z

Thanks @d-v-b this looks great!

I wondered why you deprecated nchunks (and nchunks_initialized) though? The number of chunks in an array is something that should always be well-defined. Also, deprecating something usually means there's a better alternative, but I don't see one here.

d-v-b · 2024-08-05T09:13:21Z

I wondered why you deprecated nchunks (and nchunks_initialized) though? The number of chunks in an array is something that should always be well-defined. Also, deprecating something usually means there's a better alternative, but I don't see one here.

my thinking for this is twofold:

with the new chunks_initialized function that gives the names of the initialized chunks, one can easily do len(chunks_initialized(...)), i.e. we don't need a separate function to express the composition of chunks_initialized and len. similarly, nchunks is merely len(array._iter_chunk_keys). If this logic is unsound, or these deprecation warnings are a problem, then we can remove them, but see the second point:
we haven't yet figured out how we are going to express sharded arrays in the top-level array API, and I think those decisions might require rethinking how we express chunking more broadly. see conversations happening in this discussion. Until we solve that problem, I don't feel comfortable committing to any "here's what your chunks are like" APIs, especially if they are APIs that developed pre-sharding. hence adding private methods in this PR, and the deprecation warnings.

does this check out? I'm sorry if the warnings are inconvenient, but I really would like to find a proper expression of v3 semantics on the Array class and I worry that a blanket policy of forward-propagating v2-isms could be a hindrance to that effort.

d-v-b · 2024-08-05T09:17:03Z

The number of chunks in an array is something that should always be well-defined.

to expand on this: v3 introduces two kinds of chunks, read-chunks and write chunks. the number of read chunks may not equal the number of write chunks. so where we had 1 nchunks quantity in v2, v3 has two possible answers to nchunks. that's why it is not straightforward to commit to this aspect of the array API.

src/zarr/abc/store.py

…nto add-array-storage-helpers

…ods, and they can take an origin kwarg

…nto add-array-storage-helpers

…array-storage-helpers

jhamman

@d-v-b - First of all, thanks for working on this! I must say I've soured a bit on deprecating some of these interfaces for the 3.0 release unless we have something to replace them. There's nothing to keep us from deprecating these in 3.1 (for example) if we come up with a new interface. What do you think about removing the warnings for now and coming back to this with a new interface down the road?

The number of chunks in an array is something that should always be well-defined.

to expand on this: v3 introduces two kinds of chunks, read-chunks and write chunks. the number of read chunks may not equal the number of write chunks. so where we had 1 nchunks quantity in v2, v3 has two possible answers to nchunks. that's why it is not straightforward to commit to this aspect of the array API.

I understand that we're trying to incorporate sharding here. At the risk of opening up a big can of worms, I think a we may be taking this too far. To me its much easier to think of chunks as the minimal block of data. Beyond that, sharding may allow you to store many chunks in a single object.

jhamman · 2024-09-24T02:51:26Z

src/zarr/core/array.py

@@ -443,6 +449,55 @@ def basename(self) -> str | None:
            return self.name.split("/")[-1]
        return None

+    @property
+    @deprecated("AsyncArray.cdata_shape may be removed in an early zarr-python v3 release.")
+    def cdata_shape(self) -> ChunkCoords:


I'm curious which of these helpers could migrate to the purview of the chunk grid.

d-v-b · 2024-09-24T07:15:49Z

What do you think about removing the warnings for now and coming back to this with a new interface down the road?

That works for me!

I understand that we're trying to incorporate sharding here. At the risk of opening up a big can of worms, I think a we may be taking this too far. To me its much easier to think of chunks as the minimal block of data. Beyond that, sharding may allow you to store many chunks in a single object.

Noted, I will pull back from the brink and make things more v2-ish again :)

…array-storage-helpers

…e kwarg to grid iteration; make chunk grid iterators consistent for array and async array

d-v-b · 2024-09-24T13:14:49Z

@jhamman take a look when you have time, I think I addressed your concerns.

d-v-b · 2024-09-24T13:23:35Z

I should point out that this PR also contains some changes unrelated the the array API, but I think they are useful improvements:

adds a _get_many method to all stores (the default implementation calls get in a loop). Stores that support any kind of batching / streaming can override this method as needed.
defines a ByteRangeRequest type and ensures that all functions / methods that work with byte ranges are typed with ByteRangeRequest. Previously, we had inconsistent typing for that parameter across store methods and implementations.

…hon into add-array-storage-helpers

…nto add-array-storage-helpers

* v3: (21 commits) Default zarr.open to open_group if shape is not provided (zarr-developers#2158) feat: metadata-only support for storage transformers metadata (zarr-developers#2180) fix(async): set default concurrency to 10 tasks (zarr-developers#2256) chore(deps): drop support for python 3.10 and numpy 1.24 (zarr-developers#2217) feature(store): add LoggingStore wrapper (zarr-developers#2231) Apply assorted ruff/flake8-simplify rules (SIM) (zarr-developers#2259) Add array storage helpers (zarr-developers#2065) Apply ruff/flake8-annotations rule ANN204 (zarr-developers#2258) No need to run DeepSource any more - we use ruff (zarr-developers#2261) Remove unnecessary lambda expression (zarr-developers#2260) Enforce ruff/flake8-comprehensions rules (C4) (zarr-developers#2239) Use `map(str, *)` in `test_accessed_chunks` (zarr-developers#2229) Replace Gitter with Zulip (zarr-developers#2254) Enforce ruff/flake8-pytest-style rules (PT) (zarr-developers#2236) Fix multiple identical imports (zarr-developers#2241) Enforce ruff/flake8-return rules (RET) (zarr-developers#2237) Enforce ruff/flynt rules (FLY) (zarr-developers#2240) Fix fill_value handling for complex dtypes (zarr-developers#2200) Update V2 codec pipeline to use concrete classes (zarr-developers#2244) Apply and enforce more ruff rules (zarr-developers#2053) ...

d-v-b added 6 commits August 3, 2024 11:36

implement store.list_prefix and store._set_dict

ebbfbe0

simplify string handling

da6083e

add nchunks_initialized, and necessary additions for it

dc5fe47

rename _iter_chunks to _iter_chunk_coords

b694b6e

fix test name

6a27ca8

bring in correct store list_dir implementations

d15be9a

d-v-b requested review from jhamman and normanrz August 3, 2024 14:56

jhamman reviewed Aug 6, 2024

View reviewed changes

src/zarr/abc/store.py Outdated Show resolved Hide resolved

jhamman added the V3 label Aug 9, 2024

d-v-b added 13 commits August 12, 2024 22:21

Merge branch 'v3' of https://github.com/zarr-developers/zarr-python i…

ef34f25

…nto add-array-storage-helpers

bump numcodecs to dodge zstd exception

962ffed

remove store._set_dict, and add _set_many and get_many instead

5c98ab4

update deprecation warning template

9e64fa8

add a type annotation

a4b4696

refactor chunk iterators. they are not properties any more, just meth…

04b1d6a

…ods, and they can take an origin kwarg

Merge branch 'v3' of https://github.com/zarr-developers/zarr-python i…

12b3bc1

…nto add-array-storage-helpers

Merge branch 'v3' of github.com:zarr-developers/zarr-python into add-…

3e2c656

…array-storage-helpers

_get_many returns tuple[str, buffer]

b7c1a56

stricter store types

44bed5c

Merge branch 'v3' of github.com:zarr-developers/zarr-python into add-…

021d41e

…array-storage-helpers

fix types

2db860b

Merge branch 'v3' of github.com:zarr-developers/zarr-python into add-…

45f27b1

…array-storage-helpers

jhamman requested changes Sep 24, 2024

View reviewed changes

Merge branch 'v3' of github.com:zarr-developers/zarr-python into add-…

78f22b9

…array-storage-helpers

d-v-b added 5 commits September 24, 2024 12:10

lint

b5e08e8

remove deprecation warnings

43743e1

fix zip list_prefix

f65a6e8

tests for nchunks_initialized, chunks_initialized; add selection_shap…

df6f9a7

…e kwarg to grid iteration; make chunk grid iterators consistent for array and async array

add nchunks test

e60cbe0

fix docstrings

5c54449

d-v-b added 4 commits September 24, 2024 15:25

fix docstring

e8598c6

Merge branch 'v3' into add-array-storage-helpers

ae216e1

revert unnecessary changes to project config

768ab43

Merge branch 'add-array-storage-helpers' of github.com:d-v-b/zarr-pyt…

c953f21

…hon into add-array-storage-helpers

jhamman approved these changes Sep 25, 2024

View reviewed changes

Merge branch 'v3' of https://github.com/zarr-developers/zarr-python i…

f0d61b2

…nto add-array-storage-helpers

d-v-b merged commit f0443db into zarr-developers:v3 Sep 26, 2024
26 checks passed

d-v-b deleted the add-array-storage-helpers branch September 26, 2024 18:56

jhamman added this to the 3.0.0.beta milestone Oct 17, 2024

kylebarron mentioned this pull request Oct 23, 2024

object-store-based Store implementation #1661

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add array storage helpers #2065

Add array storage helpers #2065

d-v-b commented Aug 3, 2024 •

edited

Loading

d-v-b commented Aug 3, 2024

tomwhite commented Aug 5, 2024

d-v-b commented Aug 5, 2024 •

edited

Loading

d-v-b commented Aug 5, 2024 •

edited

Loading

jhamman left a comment

jhamman Sep 24, 2024

d-v-b commented Sep 24, 2024

d-v-b commented Sep 24, 2024

d-v-b commented Sep 24, 2024

Add array storage helpers #2065

Add array storage helpers #2065

Conversation

d-v-b commented Aug 3, 2024 • edited Loading

details

d-v-b commented Aug 3, 2024

tomwhite commented Aug 5, 2024

d-v-b commented Aug 5, 2024 • edited Loading

d-v-b commented Aug 5, 2024 • edited Loading

jhamman left a comment

Choose a reason for hiding this comment

jhamman Sep 24, 2024

Choose a reason for hiding this comment

d-v-b commented Sep 24, 2024

d-v-b commented Sep 24, 2024

d-v-b commented Sep 24, 2024

d-v-b commented Aug 3, 2024 •

edited

Loading

d-v-b commented Aug 5, 2024 •

edited

Loading

d-v-b commented Aug 5, 2024 •

edited

Loading