Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array storage helpers #2065

Merged
merged 31 commits into from
Sep 26, 2024
Merged

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Aug 3, 2024

This PR adds nchunks, nbytes, and nchunks_initialized functionality from 2.x.

closes #2027
depends on #2064

details

Adds the following to array.py:

  • (AsyncArray / Array).nchunks : deprecated, the total number of chunks in the array. exists for 2.xx compatibility.
  • (AsyncArray / Array).cdata_shape : deprecated, the shape of the chunk grid. exists for 2.xx compatibility.
  • (AsyncArray / Array).nbytes : the total number of bytes that the array can store
  • (AsyncArray / Array)._iter_chunk_coords : an iterator over tuples of ints which represent positions in the chunk grid
  • (AsyncArray / Array)._iter_chunk_regions : an iterator over slices which represent the contiguous array region spanned by each chunk
  • (AsyncArray / Array)._iter_chunk_keys : an iterator over strings which represent the paths in storage for all the chunks
  • chunks_initialized(array): a function that takes an array and returns a tuple of the chunk keys for that array that exist in storage. this also has tests.
  • nchunks_initialized(array): deprecated, a function that calls len(chunks_initialized(array)). this exists for 2.xx compatibility.

All of the above _iter_chunk_* methods should be considered private and provisional. I added them because their functionality is valuable, but eventually I think we will have a better array API that renders these methods obsolete. If we think these are cluttering the array API, I'd be happy splitting them off into stand-alone functions.

  • adds a function iter_grid to indexing.py, this just provides lexicographic iteration over the elements of a bounded N-dimensional, positive grid (e.g., a grid of chunks).

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b d-v-b requested review from jhamman and normanrz August 3, 2024 14:56
@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 3, 2024

@tomwhite let me know if this looks workable for you

@tomwhite
Copy link
Contributor

tomwhite commented Aug 5, 2024

Thanks @d-v-b this looks great!

I wondered why you deprecated nchunks (and nchunks_initialized) though? The number of chunks in an array is something that should always be well-defined. Also, deprecating something usually means there's a better alternative, but I don't see one here.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 5, 2024

I wondered why you deprecated nchunks (and nchunks_initialized) though? The number of chunks in an array is something that should always be well-defined. Also, deprecating something usually means there's a better alternative, but I don't see one here.

my thinking for this is twofold:

  • with the new chunks_initialized function that gives the names of the initialized chunks, one can easily do len(chunks_initialized(...)), i.e. we don't need a separate function to express the composition of chunks_initialized and len. similarly, nchunks is merely len(array._iter_chunk_keys). If this logic is unsound, or these deprecation warnings are a problem, then we can remove them, but see the second point:
  • we haven't yet figured out how we are going to express sharded arrays in the top-level array API, and I think those decisions might require rethinking how we express chunking more broadly. see conversations happening in this discussion. Until we solve that problem, I don't feel comfortable committing to any "here's what your chunks are like" APIs, especially if they are APIs that developed pre-sharding. hence adding private methods in this PR, and the deprecation warnings.

does this check out? I'm sorry if the warnings are inconvenient, but I really would like to find a proper expression of v3 semantics on the Array class and I worry that a blanket policy of forward-propagating v2-isms could be a hindrance to that effort.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 5, 2024

The number of chunks in an array is something that should always be well-defined.

to expand on this: v3 introduces two kinds of chunks, read-chunks and write chunks. the number of read chunks may not equal the number of write chunks. so where we had 1 nchunks quantity in v2, v3 has two possible answers to nchunks. that's why it is not straightforward to commit to this aspect of the array API.

src/zarr/abc/store.py Outdated Show resolved Hide resolved
@jhamman jhamman added the V3 label Aug 9, 2024
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-v-b - First of all, thanks for working on this! I must say I've soured a bit on deprecating some of these interfaces for the 3.0 release unless we have something to replace them. There's nothing to keep us from deprecating these in 3.1 (for example) if we come up with a new interface. What do you think about removing the warnings for now and coming back to this with a new interface down the road?


The number of chunks in an array is something that should always be well-defined.

to expand on this: v3 introduces two kinds of chunks, read-chunks and write chunks. the number of read chunks may not equal the number of write chunks. so where we had 1 nchunks quantity in v2, v3 has two possible answers to nchunks. that's why it is not straightforward to commit to this aspect of the array API.

I understand that we're trying to incorporate sharding here. At the risk of opening up a big can of worms, I think a we may be taking this too far. To me its much easier to think of chunks as the minimal block of data. Beyond that, sharding may allow you to store many chunks in a single object.

@@ -443,6 +449,55 @@ def basename(self) -> str | None:
return self.name.split("/")[-1]
return None

@property
@deprecated("AsyncArray.cdata_shape may be removed in an early zarr-python v3 release.")
def cdata_shape(self) -> ChunkCoords:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious which of these helpers could migrate to the purview of the chunk grid.

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 24, 2024

What do you think about removing the warnings for now and coming back to this with a new interface down the road?

That works for me!

I understand that we're trying to incorporate sharding here. At the risk of opening up a big can of worms, I think a we may be taking this too far. To me its much easier to think of chunks as the minimal block of data. Beyond that, sharding may allow you to store many chunks in a single object.

Noted, I will pull back from the brink and make things more v2-ish again :)

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 24, 2024

@jhamman take a look when you have time, I think I addressed your concerns.

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 24, 2024

I should point out that this PR also contains some changes unrelated the the array API, but I think they are useful improvements:

  • adds a _get_many method to all stores (the default implementation calls get in a loop). Stores that support any kind of batching / streaming can override this method as needed.
  • defines a ByteRangeRequest type and ensures that all functions / methods that work with byte ranges are typed with ByteRangeRequest. Previously, we had inconsistent typing for that parameter across store methods and implementations.

@d-v-b d-v-b merged commit f0443db into zarr-developers:v3 Sep 26, 2024
26 checks passed
@d-v-b d-v-b deleted the add-array-storage-helpers branch September 26, 2024 18:56
dcherian added a commit to dcherian/zarr-python that referenced this pull request Sep 27, 2024
* v3: (21 commits)
  Default zarr.open to open_group if shape is not provided (zarr-developers#2158)
  feat: metadata-only support for storage transformers metadata (zarr-developers#2180)
  fix(async): set default concurrency to 10 tasks (zarr-developers#2256)
  chore(deps): drop support for python 3.10 and numpy 1.24 (zarr-developers#2217)
  feature(store): add LoggingStore wrapper (zarr-developers#2231)
  Apply assorted ruff/flake8-simplify rules (SIM) (zarr-developers#2259)
  Add array storage helpers (zarr-developers#2065)
  Apply ruff/flake8-annotations rule ANN204 (zarr-developers#2258)
  No need to run DeepSource any more - we use ruff (zarr-developers#2261)
  Remove unnecessary lambda expression (zarr-developers#2260)
  Enforce ruff/flake8-comprehensions rules (C4) (zarr-developers#2239)
  Use `map(str, *)` in `test_accessed_chunks` (zarr-developers#2229)
  Replace Gitter with Zulip (zarr-developers#2254)
  Enforce ruff/flake8-pytest-style rules (PT) (zarr-developers#2236)
  Fix multiple identical imports (zarr-developers#2241)
  Enforce ruff/flake8-return rules (RET) (zarr-developers#2237)
  Enforce ruff/flynt rules (FLY) (zarr-developers#2240)
  Fix fill_value handling for complex dtypes (zarr-developers#2200)
  Update V2 codec pipeline to use concrete classes (zarr-developers#2244)
  Apply and enforce more ruff rules (zarr-developers#2053)
  ...
@jhamman jhamman added this to the 3.0.0.beta milestone Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[v3] Missing array attributes: nbytes, nchunks, nchunks_initialized
3 participants