Avoid coercing to numpy in `as_shared_dtypes` #8714

TomNicholas · 2024-02-06T09:35:22Z

Solves the problem in Only use CopyOnWriteArray wrapper on BackendArrays #8712 (comment)
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
~~New functions/methods are listed in api.rst~~

TomNicholas · 2024-02-06T09:37:08Z

xarray/core/duck_array_ops.py

    else:
-        arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]


Previously this asarray call would coerce to numpy unnecessarily, when all we really wanted was an array type that we could examine the .dtype attribute of.

xarray/core/duck_array_ops.py

TomNicholas · 2024-02-06T09:41:35Z

xarray/core/duck_array_ops.py

+        return data
+    elif hasattr(data, "get_duck_array"):
+        # must be a lazy indexing class wrapping a duck array
+        return data.get_duck_array()


Will this idea always work? What if it steps down through a lazy decoder class that changes the dtype...

Those should be going through

xarray/xarray/coding/variables.py

Lines 52 to 64 in c9ba2be

class _ElementwiseFunctionArray(indexing.ExplicitlyIndexedNDArrayMixin):

"""Lazily computed array holding values of elemwise-function.

Do not construct this object directly: call lazy_elemwise_func instead.

Values are computed upon indexing or coercion to a NumPy array.

"""

def __init__(self, array, func: Callable, dtype: np.typing.DTypeLike):

assert not is_chunked_array(array)

self.array = indexing.as_indexable(array)

self.func = func

self._dtype = dtype

so you should be fine.

I think I'm getting confused as to how this all works now... Don't I want to be computing as_shared_dtype using the dtype of the outermost wrapped class? Whereas this will step through all the way to the innermost duckarray, which may have a different dtype?

As of now, as_shared_dtype is expected to return pure duck arrays for stack, concatenate, and where.

So that means we need to read from disk, which you do with to_duck_array and all these wrapper layers will be resolved.

It will get more complicated when we do lazy concatenation in Xarray, then we'd need to lazily infer dtypes and apply a lazy astype.

TomNicholas · 2024-02-06T17:30:54Z

Testing this is confusing me - I want to add an xr.Variable.concat to test_dataset.py::TestDataset:test_lazy_load_duckarray but the existing DuckArrayWrapper class does a funny trick where it defines __array_namespace__ but doesn't actually implement the xp namespace...

dcherian · 2024-02-06T17:37:05Z

xarray/core/pycompat.py

    from xarray.core.indexing import ExplicitlyIndexed

    if isinstance(data, ExplicitlyIndexed):
        return data.get_duck_array()
    elif is_duck_array(data):
        return data
    else:
-        return np.asarray(data)
+        from xarray.core.duck_array_ops import asarray


Can we use the to_numpy in this file instead?

dcherian · 2024-02-06T17:38:16Z

xarray/core/duck_array_ops.py

-    out_type = dtypes.result_type(*arrays)
-    return [astype(x, out_type, copy=False) for x in arrays]
+    """Cast arrays to a shared dtype using xarray's type promotion rules."""
+    duckarrays = [to_duck_array(obj, xp=xp) for obj in scalars_or_arrays]


This is fine but will force a read from disk. We could add a dtype property that forwards to the underlying array.dtype

EDIT: I don't think my comment is right, since we expect to return duck arrays here, it's ok to just read from disk and create that duck array.

It will get more complicated when we do lazy concatenation in Xarray, then we'd need to lazily infer dtypes and apply a lazy astype.

When you say "read from disk" do you meaning calling the __array__ attribute of the innermost duckarray? Because that's what I'm trying to avoid.

EDIT: Or you mean resolving all these wrapper layers (either by calling __array__ or get_duck_array())?

do you meaning calling the array attribute of the innermost duckarray?

I think our naming convention is that "duck array" is a "computational array" e.g.numpy, dask but NOT our explicitly-indexed array classes. The latter wrap duck arrays.

Read from disk should be happening by calling get_duck_array on the outermost ExplicitlyIndexed class, which should propagate down to BackendArray which reads bytes using either indexing or np.asarray (I think).

(related : zarr-developers/zarr-python#1603 (comment))

PS: We could chat on a call some time if you want. It's all quite confusing :) This is a good opportunity to add some comments/docs for devs

Read from disk should be happening by calling get_duck_array on the outermost ExplicitlyIndexed class, which should propagate down to BackendArray which reads bytes using either indexing or np.asarray (I think).

Yes, but my KerchunkArray case is interesting because I don't want to use BackendArray (I have no use for CopyOnWrite because I'm never loading bytes, nor for Lazy indexing (I can't index into the KerchunkArray at all).

PS: We could chat on a call some time if you want. It's all quite confusing :) This is a good opportunity to add some comments/docs for devs

Yeah that could be helpful actually :) I'm learning a lot right now about a part of xarray I have never had a reason to look at before!

but my KerchunkArray case is interesting because I don't want to use BackendArray

Well then don't use the backend infrastructure? :P

Haha yes yes

No but seriously I did think about that and I do think that it does make sense to use the backend infrastructure here. I could make my full case, but after all we are still reading from files here, we just aren't reading the bytes inside the chunks.

dcherian · 2024-02-06T18:53:03Z

class does a funny trick where it defines array_namespace but doesn't actually implement the xp namespace...

It's faking to get past the checks in as_compatible_data.

TomNicholas · 2024-02-06T18:58:52Z

It's faking to get past the checks in as_compatible_data.

Yep, but doing it this way (instead of e.g. defining __array_function__) is violating the contract the array API represents, so I'm finding that get_array_namespace gets called on it (and causes errors after not finding the namespace). I haven't gotten to the bottom of this yet though.

TomNicholas added 2 commits February 6, 2024 04:33

extract dtypes from underlying duck arrays without coercing to numpy

c6f4e3a

remove print statements

1467c4c

TomNicholas commented Feb 6, 2024

View reviewed changes

xarray/core/duck_array_ops.py Outdated Show resolved Hide resolved

use array namespace again

d9931ef

TomNicholas commented Feb 6, 2024

View reviewed changes

TomNicholas added the topic-arrays related to flexible array support label Feb 6, 2024

TomNicholas added 4 commits February 6, 2024 11:49

use pycompat.to_duck_array instead

c067f7d

a sprinkle of type hints

5092aaa

whatsnew

a884ba8

Merge branch 'main' into dont-coerce-to-numpy-in-as-shared-dtype

86e6bf8

dcherian reviewed Feb 6, 2024

View reviewed changes

This was referenced Feb 6, 2024

Only use CopyOnWriteArray wrapper on BackendArrays #8712

Open

Opt out of auto creating index variables #8711

Merged

ilan-gold mentioned this pull request Feb 12, 2024

(feat): Support for pandas ExtensionArray #8723

Merged

6 tasks

TomNicholas added 2 commits March 24, 2024 20:27

Merge branch 'main' into dont-coerce-to-numpy-in-as-shared-dtype

630629c

don't compute in as_shared_dtype

45808d8

TomNicholas added 3 commits March 28, 2024 13:07

Merge branch 'main' into dont-coerce-to-numpy-in-as-shared-dtype

e625c67

ConcatenatableArray class

6833d66

test that no coercion occurs using ConcatenatableArray

bcd02bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid coercing to numpy in `as_shared_dtypes` #8714

Avoid coercing to numpy in `as_shared_dtypes` #8714

TomNicholas commented Feb 6, 2024 •

edited

Loading

TomNicholas Feb 6, 2024

TomNicholas Feb 6, 2024

dcherian Feb 6, 2024

TomNicholas Feb 6, 2024

dcherian Feb 6, 2024 •

edited

Loading

TomNicholas commented Feb 6, 2024

dcherian Feb 6, 2024

dcherian Feb 6, 2024 •

edited

Loading

TomNicholas Feb 6, 2024 •

edited

Loading

dcherian Feb 6, 2024

TomNicholas Feb 6, 2024

dcherian Feb 6, 2024

TomNicholas Feb 6, 2024

dcherian commented Feb 6, 2024

TomNicholas commented Feb 6, 2024

		else:
		arrays = [asarray(x, xp=xp) for x in scalars_or_arrays]

	class _ElementwiseFunctionArray(indexing.ExplicitlyIndexedNDArrayMixin):
	"""Lazily computed array holding values of elemwise-function.

	Do not construct this object directly: call lazy_elemwise_func instead.

	Values are computed upon indexing or coercion to a NumPy array.
	"""

	def __init__(self, array, func: Callable, dtype: np.typing.DTypeLike):
	assert not is_chunked_array(array)
	self.array = indexing.as_indexable(array)
	self.func = func
	self._dtype = dtype

Avoid coercing to numpy in as_shared_dtypes #8714

Are you sure you want to change the base?

Avoid coercing to numpy in as_shared_dtypes #8714

Conversation

TomNicholas commented Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas commented Feb 6, 2024

Choose a reason for hiding this comment

dcherian Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian commented Feb 6, 2024

TomNicholas commented Feb 6, 2024

Avoid coercing to numpy in `as_shared_dtypes` #8714

Avoid coercing to numpy in `as_shared_dtypes` #8714

TomNicholas commented Feb 6, 2024 •

edited

Loading

dcherian Feb 6, 2024 •

edited

Loading

dcherian Feb 6, 2024 •

edited

Loading

TomNicholas Feb 6, 2024 •

edited

Loading