Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid coercing to numpy in as_shared_dtypes #8714

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Feb 6, 2024

else:
arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this asarray call would coerce to numpy unnecessarily, when all we really wanted was an array type that we could examine the .dtype attribute of.

return data
elif hasattr(data, "get_duck_array"):
# must be a lazy indexing class wrapping a duck array
return data.get_duck_array()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this idea always work? What if it steps down through a lazy decoder class that changes the dtype...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those should be going through

class _ElementwiseFunctionArray(indexing.ExplicitlyIndexedNDArrayMixin):
"""Lazily computed array holding values of elemwise-function.
Do not construct this object directly: call lazy_elemwise_func instead.
Values are computed upon indexing or coercion to a NumPy array.
"""
def __init__(self, array, func: Callable, dtype: np.typing.DTypeLike):
assert not is_chunked_array(array)
self.array = indexing.as_indexable(array)
self.func = func
self._dtype = dtype

so you should be fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm getting confused as to how this all works now... Don't I want to be computing as_shared_dtype using the dtype of the outermost wrapped class? Whereas this will step through all the way to the innermost duckarray, which may have a different dtype?

Copy link
Contributor

@dcherian dcherian Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, as_shared_dtype is expected to return pure duck arrays for stack, concatenate, and where.

So that means we need to read from disk, which you do with to_duck_array and all these wrapper layers will be resolved.

It will get more complicated when we do lazy concatenation in Xarray, then we'd need to lazily infer dtypes and apply a lazy astype.

@TomNicholas TomNicholas added the topic-arrays related to flexible array support label Feb 6, 2024
@TomNicholas
Copy link
Member Author

Testing this is confusing me - I want to add an xr.Variable.concat to test_dataset.py::TestDataset:test_lazy_load_duckarray but the existing DuckArrayWrapper class does a funny trick where it defines __array_namespace__ but doesn't actually implement the xp namespace...

from xarray.core.indexing import ExplicitlyIndexed

if isinstance(data, ExplicitlyIndexed):
return data.get_duck_array()
elif is_duck_array(data):
return data
else:
return np.asarray(data)
from xarray.core.duck_array_ops import asarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the to_numpy in this file instead?

out_type = dtypes.result_type(*arrays)
return [astype(x, out_type, copy=False) for x in arrays]
"""Cast arrays to a shared dtype using xarray's type promotion rules."""
duckarrays = [to_duck_array(obj, xp=xp) for obj in scalars_or_arrays]
Copy link
Contributor

@dcherian dcherian Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine but will force a read from disk. We could add a dtype property that forwards to the underlying array.dtype

EDIT: I don't think my comment is right, since we expect to return duck arrays here, it's ok to just read from disk and create that duck array.

It will get more complicated when we do lazy concatenation in Xarray, then we'd need to lazily infer dtypes and apply a lazy astype.

Copy link
Member Author

@TomNicholas TomNicholas Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "read from disk" do you meaning calling the __array__ attribute of the innermost duckarray? Because that's what I'm trying to avoid.

EDIT: Or you mean resolving all these wrapper layers (either by calling __array__ or get_duck_array())?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you meaning calling the array attribute of the innermost duckarray?

I think our naming convention is that "duck array" is a "computational array" e.g.numpy, dask but NOT our explicitly-indexed array classes. The latter wrap duck arrays.

Read from disk should be happening by calling get_duck_array on the outermost ExplicitlyIndexed class, which should propagate down to BackendArray which reads bytes using either indexing or np.asarray (I think).

(related : zarr-developers/zarr-python#1603 (comment))

PS: We could chat on a call some time if you want. It's all quite confusing :) This is a good opportunity to add some comments/docs for devs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read from disk should be happening by calling get_duck_array on the outermost ExplicitlyIndexed class, which should propagate down to BackendArray which reads bytes using either indexing or np.asarray (I think).

Yes, but my KerchunkArray case is interesting because I don't want to use BackendArray (I have no use for CopyOnWrite because I'm never loading bytes, nor for Lazy indexing (I can't index into the KerchunkArray at all).

PS: We could chat on a call some time if you want. It's all quite confusing :) This is a good opportunity to add some comments/docs for devs

Yeah that could be helpful actually :) I'm learning a lot right now about a part of xarray I have never had a reason to look at before!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but my KerchunkArray case is interesting because I don't want to use BackendArray

Well then don't use the backend infrastructure? :P

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yes yes

No but seriously I did think about that and I do think that it does make sense to use the backend infrastructure here. I could make my full case, but after all we are still reading from files here, we just aren't reading the bytes inside the chunks.

@dcherian
Copy link
Contributor

dcherian commented Feb 6, 2024

class does a funny trick where it defines array_namespace but doesn't actually implement the xp namespace...

It's faking to get past the checks in as_compatible_data.

@TomNicholas
Copy link
Member Author

It's faking to get past the checks in as_compatible_data.

Yep, but doing it this way (instead of e.g. defining __array_function__) is violating the contract the array API represents, so I'm finding that get_array_namespace gets called on it (and causes errors after not finding the namespace). I haven't gotten to the bottom of this yet though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants