-
-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Support of alternative array classes #934
Conversation
Codecov Report
@@ Coverage Diff @@
## main #934 +/- ##
========================================
Coverage 99.94% 99.95%
========================================
Files 35 36 +1
Lines 13965 14084 +119
========================================
+ Hits 13958 14077 +119
Misses 7 7
|
This is base of <https://github.com/jakirkham/zarr-python/tree/use_array_like> Co-authored-by: John Kirkham <[email protected]>
Checking against MutableMapping categories all BaseStores as in-memory stores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mads! 😄
Have a few initial comments below.
Still thinking through if there are ways we can improve the meta_array
interface from the end user perspective. Curious to get your thoughts on this as well 🙂
@@ -1772,8 +1796,9 @@ def _process_chunk( | |||
self._dtype != object): | |||
|
|||
dest = out[out_selection] | |||
dest_is_writable = getattr(dest, "writeable", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we need to check flags
for writeable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CuPy array doesn't have a writeable
flag: cupy/cupy#2616
Added
# Assume that array-like objects that doesn't have a
# `writeable` flag is writable.
dest_is_writable = getattr(dest, "writeable", True)
zarr/core.py
Outdated
if not is_scalar(value, self._dtype): | ||
value = np.asanyarray(value) | ||
try: | ||
value = ensure_ndarray(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if value
is a bytes
object both before and after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, bytearrays
are converted to uint8
arrays and bytes
are converted to |S
arrays.
Whereas now, both bytes
and bytearrays
are converted to a NumPy array of type uint8
.
I don't think this make any difference in practice.
zarr/util.py
Outdated
# TODO: move the function to numcodecs | ||
def ensure_contiguous_ndarray(buf, max_buffer_size=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just flagging this so we remember to move to Numcodecs
Curious to know what changes were needed here 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just need to use the modified ensure_ndarray
@@ -155,6 +161,7 @@ def __init__( | |||
cache_attrs=True, | |||
partial_decompress=False, | |||
write_empty_chunks=True, | |||
meta_array=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this could be picked up from the Store
itself. That way a Store
could specify its return type (NumPy or otherwise). This would save the user from getting involved as much in the process and hopefully make it easier for them to get started. WDYT?
@grlee77 would be good to get your thoughts as well in light of the BaseStore
work 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to have Store
define a default meta_array
but in the general case, I think a Store
should be able to handle both NumPy and other types like CuPy arrays simultaneously.
cc @d-v-b @joshmoore (as we discussed this last week it would be good to get your takes as well 🙂) |
cc @d70-t ( in case you have thoughts here as well 🙂 ) |
Knowing little about the requirements here, just looking at the example:
I wonder a couple of things:
|
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: jakirkham <[email protected]>
Co-authored-by: jakirkham <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @joshmoore, that this example:
z = zarr.array(a, chunks=10, meta_array=cupy.empty(()), compressor=CuPyCPUCompressor(Zlib()))
looks a little odd.
I'm wondering if we should make more of a distinction between
- the compressor specification (the goal of how bytes should look like in the store) and
- the compressor implementation (the path of how to make and read those bytes specifically)
When thinking of what I'd want to express in this line, I'd say
I want to put my array
a
onto a store (which is implicitlydict
in this case), in chunks of 10 and it should be zlib-compressed.
I don't really care about how the data moves from a
to the store (if a is a numpy
array, I'd probably like to use numcodecs
directly, if a
is a cupy
array, I'd like to use GDS and a GPU compressor if available and like to fall back to numcodecs if not). In that sense, the line should maybe read like z = zarr.array(a, chunks=10, compressor=Zlib())
.
Afterwards, I'd probably want to get a selection back from the zarr-array z
, and I might want that to live on the GPU (which might be a different choice from what a
was...), this creates a bit of a problem. I might want to write something like:
selection: cupy.ndarray = z[3:7]
... but Python doesnt't (yet?) work this way, so we need to pass the type-information differently into z
. It might be
selection = z.get(slice(3,7), meta_array=cupy.empty(()))
which is far less beautiful than
selection = z[3:7]
For the last option to work, there must be a way to change the default return type of the []
-operator, so passing the meta_array
into zarr.array
additionally might be a viable option. Still something like the get
might be useful for some, if the array might be used both on CPU and GPU.
In any case, up to now, the CuPyCPUCompressor
(which I consider as a description of the path between array and store) would not be visible at any place, and it might also be different for different set / get calls of the same zarr array z
. It might be given as an optional hint on creation of the array, if a specific path is desired. Probably even with a way of overriding it on set or get. In any case, I believe that having any implementation (path) -specific information stored in zarray metadata would be a bad thing. In particular, I'd argue that cupy_cpu_compressor
should maybe not appear in .zarray
metadata.
Agree, it would be a good idea to have
Sorry I should have been more clear, the
Agree, when writing arrays it should be easy enough to infer the array-like type but when creating a Zarr array from scratch, we don't have that information. |
True, a compressor-wrapper seems to be a relatively light way to implement such GPU aware stores. But I'm wondering if that's really the API we'll want to end up with. Probably we'll also want to end up having such compressor-wrappers as a simple way to make more compressor-implementations, but maybe we don't want to write them out at this function call. That's about what I meant with separating compressor-implementations and compressor-specifications. We likely want to create zlib-compressed data on the store (the specification), but might want to use the CuPyCPU(Zlib) implementation for the transformation at that very moment of writing the data.
Do we need that information, when creating the array from scratch? The array will probably only contain the metadata keys and those should maybe be independent of the compressor-implementation.
Yes, that's about what I had in mind. It would be great to have arrays of any kind on one side and stores and compression specifications on the other side and some flexible, auto-optimizing routing in between. I believe that the specific algorithm of how to compress / decompress and move the data in between should neither belong to the store nor to the array-type. In particular, it seems to be a totally valid approach to use e.g. a CPU to write data to some (e.g. S3-) store and read data back from the same store later on via GPU. |
I just stumbled again across the mentioned entrypoints... It's probably worth a thought if something roughly like: compressors =[
{"array": "numpy", "algorithm": "zlib": "compressor": Zlib, "priority": 1},
{"array": "cupy", "algorithm": "zlib": "compressor": CuPyCPUCompressor(Zlib), "priority": 2},
{"array": "cupy", "algorithm": "zlib": "compressor": CuPyGDSZlib, "priority": 1},
...
] could be established via entrypoints? That list would be dependent on which libraries are present on the system and would help finding the right compressor implementation for a given combination of array and codec.... Still I don't have though yet about how that might interfere with |
@d70-t I agree with your vision but I think we should move in small steps. I think we all can agree that we need some way to specify the desired array-like type of a selection = z.get(slice(3,7), meta_array=cupy.empty(())) And as @joshmoore suggest, we might even make it higher-level configurable (e.g. through a config file and/or context manager). However, I think this is outside of the scope of this PR. The goal here is to avoid force-casting all data to NumPy arrays.
Again, I agree. I can move the implementation of |
👍 for moving in small steps. And I also agree to the rest of the post @madsbk The only thing vision-wise which I believe should be accounted for right now, would be that indications of the (in-memory) array type should not be part of the zarr-metadata (and I'm still uncertain about if it would be ok to do so as a part of a test. Probably that's ok). But that's more a personal feeling and should maybe be up to discussion and others might have different opinions. |
This is ready for another round of reviews cc. @jakirkham, @d70-t |
@madsbk: I'm slowly working my way towards a re-review here, but in the meantime, any thoughts on the codecov front? |
The codecov errors can be ignored. As far as I can see, all of them are false negatives. |
Or is the issue that the cupy tests aren't being run and therefore the |
You are absolutely right! No idea how I could miss that :) Adding tests of |
Hi @joshmoore, anything you need from my side to get progress here? |
Hi @madsbk. No, sorry. This just landed on the vacation ⛰️. I'll start reviewing for a 2.13 release ASAP. |
Rolling into a 2.13 pre-release. If anyone has any lingering concerns about the API, there would still be a (small) window to address them in. |
which include zarr-developers/zarr-python#934
which include zarr-developers/zarr-python#934 Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #129
Thanks Mads for working on this and Josh for reviewing! 🙏 |
This PR implements support of alternative array types (other than NumPy arrays) by introducing a new argument,
meta_array
, that specifies the type/class of the underlying array.The
meta_array
argument can be any class instance that can be used as thelike
argument in NumPy (see NEP 35).Also, this PR implements support of CuPy arrays by introducing a CPU compressor:
Example
A common limitation in GPU programming is available GPU memory. In the following, we create a Zarr array of CuPy arrays that are saved to host memory using the
Zlib
codec for compression/decompression:If we don't want compression but still want to copy to/from host memory, we can do the following:
If the store handles CuPy arrays directly, we can disable compression:
Notice
GPUDirect Storage (GDS) would be able to bypass the host memory when copying GPU memory to/from disk.
ensure_ndarray()
andensure_contiguous_ndarray()
that accepts CuPy arrays as well as NumPy arrays. Instead, we should properly implement more generic versions of them in numcodecs (see WIP: Allow CuPy numcodecs#212).TODO
cc. @jakirkham