-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid coercing to numpy in as_shared_dtypes
#8714
base: main
Are you sure you want to change the base?
Avoid coercing to numpy in as_shared_dtypes
#8714
Conversation
else: | ||
arrays = [asarray(x, xp=xp) for x in scalars_or_arrays] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this asarray
call would coerce to numpy unnecessarily, when all we really wanted was an array type that we could examine the .dtype
attribute of.
xarray/core/duck_array_ops.py
Outdated
return data | ||
elif hasattr(data, "get_duck_array"): | ||
# must be a lazy indexing class wrapping a duck array | ||
return data.get_duck_array() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this idea always work? What if it steps down through a lazy decoder class that changes the dtype...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those should be going through
xarray/xarray/coding/variables.py
Lines 52 to 64 in c9ba2be
class _ElementwiseFunctionArray(indexing.ExplicitlyIndexedNDArrayMixin): | |
"""Lazily computed array holding values of elemwise-function. | |
Do not construct this object directly: call lazy_elemwise_func instead. | |
Values are computed upon indexing or coercion to a NumPy array. | |
""" | |
def __init__(self, array, func: Callable, dtype: np.typing.DTypeLike): | |
assert not is_chunked_array(array) | |
self.array = indexing.as_indexable(array) | |
self.func = func | |
self._dtype = dtype |
so you should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm getting confused as to how this all works now... Don't I want to be computing as_shared_dtype
using the dtype of the outermost wrapped class? Whereas this will step through all the way to the innermost duckarray, which may have a different dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, as_shared_dtype
is expected to return pure duck arrays for stack, concatenate, and where.
So that means we need to read from disk, which you do with to_duck_array
and all these wrapper layers will be resolved.
It will get more complicated when we do lazy concatenation in Xarray, then we'd need to lazily infer dtypes and apply a lazy astype.
Testing this is confusing me - I want to add an |
xarray/core/pycompat.py
Outdated
from xarray.core.indexing import ExplicitlyIndexed | ||
|
||
if isinstance(data, ExplicitlyIndexed): | ||
return data.get_duck_array() | ||
elif is_duck_array(data): | ||
return data | ||
else: | ||
return np.asarray(data) | ||
from xarray.core.duck_array_ops import asarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the to_numpy
in this file instead?
xarray/core/duck_array_ops.py
Outdated
out_type = dtypes.result_type(*arrays) | ||
return [astype(x, out_type, copy=False) for x in arrays] | ||
"""Cast arrays to a shared dtype using xarray's type promotion rules.""" | ||
duckarrays = [to_duck_array(obj, xp=xp) for obj in scalars_or_arrays] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine but will force a read from disk. We could add a dtype
property that forwards to the underlying array.dtype
EDIT: I don't think my comment is right, since we expect to return duck arrays here, it's ok to just read from disk and create that duck array.
It will get more complicated when we do lazy concatenation in Xarray, then we'd need to lazily infer dtypes and apply a lazy astype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say "read from disk" do you meaning calling the __array__
attribute of the innermost duckarray? Because that's what I'm trying to avoid.
EDIT: Or you mean resolving all these wrapper layers (either by calling __array__
or get_duck_array()
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you meaning calling the array attribute of the innermost duckarray?
I think our naming convention is that "duck array" is a "computational array" e.g.numpy, dask but NOT our explicitly-indexed array classes. The latter wrap duck arrays.
Read from disk should be happening by calling get_duck_array
on the outermost ExplicitlyIndexed class, which should propagate down to BackendArray
which reads bytes using either indexing or np.asarray
(I think).
(related : zarr-developers/zarr-python#1603 (comment))
PS: We could chat on a call some time if you want. It's all quite confusing :) This is a good opportunity to add some comments/docs for devs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read from disk should be happening by calling get_duck_array on the outermost ExplicitlyIndexed class, which should propagate down to BackendArray which reads bytes using either indexing or np.asarray (I think).
Yes, but my KerchunkArray
case is interesting because I don't want to use BackendArray
(I have no use for CopyOnWrite because I'm never loading bytes, nor for Lazy indexing (I can't index into the KerchunkArray
at all).
PS: We could chat on a call some time if you want. It's all quite confusing :) This is a good opportunity to add some comments/docs for devs
Yeah that could be helpful actually :) I'm learning a lot right now about a part of xarray I have never had a reason to look at before!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but my KerchunkArray case is interesting because I don't want to use BackendArray
Well then don't use the backend infrastructure? :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yes yes
No but seriously I did think about that and I do think that it does make sense to use the backend infrastructure here. I could make my full case, but after all we are still reading from files here, we just aren't reading the bytes inside the chunks.
It's faking to get past the checks in |
Yep, but doing it this way (instead of e.g. defining |
whats-new.rst
New functions/methods are listed inapi.rst