Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data view /slice of zarr array without loading entire array #980

Open
aliaksei-chareshneu opened this issue Mar 5, 2022 · 5 comments
Open
Labels
documentation Improvements to the documentation

Comments

@aliaksei-chareshneu
Copy link

Dear all,

Could you tell me please how do I get a data view of a zarr array? The key thing is performance.

From the docs, it looks like there is two options:

  • Use getitem via ":" notation
    (store is existing DirectoryStore, there is one group 'sgroup' and one 3D array 'sarr')
root = zarr.group(store=store)
arr = root.sgroup.sarr
slice = arr[1:3, 1:3, 1:3]
  • Use get_basic_selection
root = zarr.group(store=store)
arr = root.sgroup.sarr
slice = arr.get_basic_selection(slice(1, 3), slice(1,3), slice(1,3))

In general, what is the difference between them? Would both options indeed get slice without loading entire array? Are there better alternatives in terms of performance?

Best regards,
Aliaksei

  • Value of zarr.__version__: 2.10.3
  • Value of numcodecs.__version__: 0.9.1
  • Version of Python interpreter: 3.8.2
  • Operating system (Linux/Windows/Mac): Windows 7
  • How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): using pip into virtual environment
@joshmoore
Copy link
Member

zarr-python should work hard not to load the entire array, but will actively load the individual chunks. If you want to defer even that, you might want to look into combing it with dask.

The recent release of 2.11 should also allow some slightly fancier indexing: https://zarr.dev/blog/release-2-11/

@shoyer
Copy link
Contributor

shoyer commented Mar 7, 2022

You might be interested in TensorStore, which can do lazy indexing of Zarr arrays: https://github.com/google/tensorstore

Xarray also has it's own lazy indexing that works on top of Zarr (with or without Dask).

@aliaksei-chareshneu
Copy link
Author

aliaksei-chareshneu commented Mar 8, 2022

zarr-python should work hard not to load the entire array, but will actively load the individual chunks. If you want to defer even that, you might want to look into combing it with dask.

The recent release of 2.11 should also allow some slightly fancier indexing: https://zarr.dev/blog/release-2-11/

@joshmoore, thank you! I had a look. But it seems that it is just syntactic sugar (like dropping 'vindex'), or there are performance benefits too?

@rabernat
Copy link
Contributor

rabernat commented Mar 8, 2022

This is related to #843.

I would also note that it has been proposed to factor Xarray's lazy indexing classes into a standalone package (pydata/xarray#5081).

@joshmoore joshmoore added the documentation Improvements to the documentation label Dec 2, 2022
@joshmoore
Copy link
Member

joshmoore commented Dec 2, 2022

Adding the documentation label if we want to close this with an addition of pointers in the documentation of how this can be done with other libraries (and/or tutorial items). If someone feels there's a feature request looming, please say the word.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements to the documentation
Projects
None yet
Development

No branches or pull requests

4 participants