Pass indexes to the Dataset and DataArray constructors #6392

benbovy · 2022-03-21T12:41:51Z

Is your feature request related to a problem?

This is part of #6293 (explicit indexes next steps).

Describe the solution you'd like

A Mapping[Hashable, Index] would probably be the most obvious (optional) value type accepted for the indexes argument of the Dataset and DataArray constructors.

pros:

consistent with the xindexes property

cons:

need to be careful with what is passed as coords and indexes
multi-indexes: redundancy and order matters (e.g., pandas multi-index levels)

An example with a pandas multi-index

Currently a pandas multi-index may be passed directly as one (dimension) coordinate ; it is then "unpacked" into one dimension (tuple values) coordinate and one or more level coordinates. I would suggest depreciating this behavior in favor of a more explicit (although more verbose) way to pass an existing pandas multi-index:

import pandas as pd
import xarray as xr

pd_idx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("foo", "bar"))
idx = xr.PandasMultiIndex(pd_idx, "x")

indexes = {"x": idx, "foo": idx, "bar": idx}
coords = idx.create_variables()

ds = xr.Dataset(coords=coords, indexes=indexes)

The cases below should raise an error:

ds = xr.Dataset(indexes=indexes)
# ValueError: missing coordinate(s) for index(es): 'x', 'foo', 'bar'

ds = xr.Dataset(
    coords=coords,
    indexes={"x": idx, "foo": idx},
)
# ValueError: missing index(es) for coordinate(s): 'bar'

ds = xr.Dataset(
    coords={"x": coords["x"], "foo": [0, 1, 2, 3], "bar": coords["bar"]},
    indexes=indexes,
)
# ValueError: conflict between coordinate(s) and index(es): 'foo'

ds = xr.Dataset(
    coords=coords,
    indexes={"x": idx, "foo": idx, "bar": xr.PandasIndex([0, 1, 2], "y")},
)
# ValueError: conflict between coordinate(s) and index(es): 'bar'

Should we raise an error or simply ignore the index in the case below?

ds = xr.Dataset(coords=coords)
# ValueError: missing index(es) for coordinate(s): 'x', 'foo', 'bar'
# or
# create unindexed coordinates 'foo' and 'bar' and a 'x' coordinate with a single pandas index

Should we silently reorder the coordinates and/or indexes when the levels are not passed in the right order? It seems odd requiring mapping elements be passed in a given order.

ds = xr.Dataset(coords=coords, indexes={"bar": idx, "x": idx, "foo": idx})
list(ds.xindexes.keys())
# ["x", "foo", "bar"]

How to generalize to any (custom) index?

With the case of multi-index, it is pretty easy to check whether the coordinates and indexes are consistent because we ensure consistent pd_idx.names vs. coordinate names and because idx.get_variables() returns Xarray IndexVariable objects where variable data wraps the pandas multi-index.

However, this may not be easy for other indexes. Some Xarray custom indexes (like a KD-Tree index) likely won't return anything from .get_variables() as they don't support wrapping internal data as coordinate data. Right now there's nothing in the Xarray Index base class that could help checking consistency between indexes vs. coordinates for any kind of index.

How could we solve this?

A. add a .coords property to the Xarray Index base class, that returns a dict[Hashable, IndexVariable].
- Ambiguous when an Index is created directly, i.e., like above xr.PandasMultiIndex(pd_idx, "x"). Should .coords return None and return the coordinates returned by the last .get_variables() call?
- What if different sets of coordinates refer to a common index (e.g., after copying the coordinate variables, etc.)?
B. add a .coord_names property to the Xarray Index base class that returns tuple[Hashable, ...], and add a private attribute to IndexVariable that returns the index object (or return it via a very lightweight IndexAdapter base class used to wrap variable data).
- Index.get_variables(variables) would by default return shallow copies of the input variables with a reference to the index object.
- If that's necessary, we could also store the coordinate dimensions in coord_names, i.e., using tuple[tuple[Hashable, tuple[Hashable, ...]], ...].

I think I prefer the second option.

Describe alternatives you've considered

Also allow passing index types (and build options) via `indexes`

I.e., Mapping[Hashable, Index | Type[Index] | tuple[TypeIndex, Mapping[Any, Any]]], so that new indexes can be created from the passed coordinates at DataArray or Dataset creation.

pros:

Flexible.

cons:

This is complicated. Constructing the Dataset / DataArray (with default indexes) first then calling .set_index is probably better.
Hard to deal with multi-index (redundancy of build option, etc.)

Pass multi-indexes once, grouped by coordinate names

I.e., indexes keys accept tuples: Mapping[Hashable | tuple[Hashable, ...], Index]

pros:

No redundancy and easier to check consistency between indexes vs. coordinates

cons:

Not consistent with the .xindexes property
Complicated when eventually using tuples for coordinate names?

Additional context

No response

The text was updated successfully, but these errors were encountered:

keewis · 2022-03-27T17:39:59Z

I wonder if it would help to have a custom type that unlike tuple is invalid for coordinates / data variables, but allows to reduce the redundancy? E.g.

indexes = {xr.combined("lat", "lon"): idx, xr.combined("z", "x", "y"): multi_index})

This would be immediately normalized to:

indexes = {"lat": idx, "lon": idx, "z": multi_index, "x": multi_index, "y": multi_index}

max-sixty · 2022-03-27T19:54:44Z

I realize there's a lot here and I've been out of this thread for a bit, so please forgive any naive questions!

I would suggest depreciating this behavior in favor of a more explicit (although more verbose) way to pass an existing pandas multi-index:

What's the rationale for deprecating this? I think my experience with users of xarray is mostly those coming from pandas; for them interop is quite important. If there's a canonical way of transforming the index, it would be friendlier to do that automatically.

import pandas as pd
import xarray as xr

pd_idx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("foo", "bar"))
idx = pd_idx

ds = xr.Dataset(coords={"x": idx})

i.e.

ds = xr.Dataset(coords=coords)
# ValueError: missing index(es) for coordinate(s): 'x', 'foo', 'bar'
# or
# create unindexed coordinates 'foo' and 'bar' and a 'x' coordinate with a single pandas index

I would have expected the later, both for coords=coords and for coords=pd_idx (again, with the disclaimer that I may be missing crucial parts of the puzzle here).

Should we silently reorder the coordinates and/or indexes when the levels are not passed in the right order? It seems odd requiring mapping elements be passed in a given order.

👍

benbovy · 2022-03-28T14:38:13Z

What's the rationale for deprecating this? I think my experience with users of xarray is mostly those coming from pandas; for them interop is quite important.

Yes I agree that interoperability with pandas is important. Providing pandas (multi-)indexes via coords is convenient and worked pretty well so far because (1) indexes and dimension coordinates were not clearly distinct concepts and (2) multi-index levels were not "real" coordinates. However, this is not the case anymore.

Now that indexes are really distinct from coordinates, I'd rather expect the following behavior for the case of pandas multi-index:

pd_idx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("foo", "bar"))

# convert a pandas multi-index to a numpy array returns level values as tuples
np.array(pd_idx)
# array([('a', 1), ('a', 2), ('b', 1), ('b', 2)], dtype=object)

# simply pass the index as a coordinate would treat it as an array-like, i.e., like numpy does
xr.Dataset(coords={"x": pd_idx})
# <xarray.Dataset>
# Dimensions:  (x: 4)
# Coordinates:
#   * x        (x) object ('a', 1) ('a', 2) ('b', 1) ('b', 2)
# Data variables:
#     *empty*

In this specific case, I'd favor consistency with how Numpy handles Pandas indexes over more convenient interoperability with Pandas. The array of tuple elements is not very useful, though. There should be ways to create Xarray objects with Pandas indexes, but I think it's better if we eventually pass them via indexes instead of via coords, or via both indexes and coords even if that's slightly less convenient.

More generally, I don't know how will evolve the ecosystem in the future (how many custom Xarray indexes?). I wonder to which point in Xarray's API we should support special cases for Pandas (multi-)indexes compared to other kinds of indexes.

max-sixty · 2022-03-30T00:32:48Z

Thanks for the thoughtful reply @benbovy

(This is a level down and you can make a decision later, so fine if you prefer to push the discussion.)

How would we handle creating xarray objects from pandas objects where they have a multiindex?

To what extent do you think this is this the "standard case" and we could default to it?

idx = xr.PandasMultiIndex(pd_idx, "x")
indexes = {"x": idx, "foo": idx, "bar": idx}

benbovy · 2022-09-28T09:13:00Z

How would we handle creating xarray objects from pandas objects where they have a multiindex?

For pandas.Series / pandas.DataFrame objects, DataArray.from_series() / Dataset.from_dataframe() already expand multi-index levels as dimensions.

For a pandas.MultiIndex, we could do like below but it is a bit tedious:

import pandas as pd
import xarray as xr
from xarray.indexes import PandasMultiIndex

pd_idx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("foo", "bar"))
idx = PandasMultiIndex(pd_idx, "x")

indexes = {"x": idx, "foo": idx, "bar": idx}
coords = idx.create_variables()

ds = xr.Dataset(coords=coords, indexes=indexes)

For more convenience, we could add a class method to PandasMultiIndex, e.g.,

# this calls PandasMultiIndex.__init__() and PandasMultiIndex.create_variables() internally
indexes, coords = PandasMultiIndex.from_pandas_index(pd_idx, "x")

ds = xr.Dataset(coords=coords, indexes=indexes)

Instead of indexes, coords raw dictionaries, we could return an instance of the Indexes class (also returned by Dataset.xindexes), which encapsulates the coordinate variables:

xmidx = PandasMultiIndex.from_pandas_index(pd_idx, "x")

ds = xr.Dataset(coords=xmidx.variables, indexes=xmidx)

For even more convenience, I think it might be reasonable to support special handling of Indexes instances given in Dataset / DataArray constructors and in .update(), i.e.,

# both cases below will implicitly add the coordinates found in `xmidx`
# (if there's no conflict with other coordinates)

ds = xr.Dataset(indexes=xmidx)

ds2 = xr.Dataset()
ds2.update(xmidx)

The same approach could be used for pandas.IntervalIndex (as discussed in #4579).

benbovy · 2022-10-25T12:19:52Z

I'm thinking of only accepting one or more instances of Indexes as indexes argument in the Dataset and DataArray constructors. The only exception is when fastpath=True a mapping can be given directly.

It is much easier to handle: just check that keys returned by Indexes.variables do no conflict with the coordinate names in the coords argument
It is slightly safer: it requires the user to explicitly create an Indexes object, thus with less chance to accidentally provide coordinate variables and index objects that do not relate to each other (we could probably add some safe guards in the Indexes class itself)
It is more convenient: an Xarray Index may provide a factory method that returns an instance of Indexes that we just need to pass as indexes

benbovy added enhancement topic-indexing labels Mar 21, 2022

forman mentioned this issue May 5, 2022

32- vs 64-bit coordinates coordinates in where() #6573

Open

4 tasks

benbovy mentioned this issue May 25, 2022

Opening dataset without loading any indexes? #6633

Open

keewis mentioned this issue Jul 16, 2022

display the indexes in the string reprs #6795

Merged

2 tasks

benbovy mentioned this issue Oct 3, 2022

Fix to_index(): return multiindex level as single index #7105

Merged

3 tasks

This was referenced Oct 25, 2022

Explicit indexes: next steps #6293

Open

Pass indexes directly to the DataArray and Dataset constructors #7214

Closed

benbovy mentioned this issue Dec 8, 2022

Expose "Coordinates" as part of Xarray's public API #7368

Merged

8 tasks

dcherian closed this as completed in #7368 Jul 21, 2023

benbovy mentioned this issue Sep 7, 2023

Deprecate passing pd.MultiIndex implicitly #8140

Open

2 tasks

jklenzing mentioned this issue Sep 8, 2023

ENH: allow custom concat_data methods pysat/pysat#1131

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass indexes to the Dataset and DataArray constructors #6392

Pass indexes to the Dataset and DataArray constructors #6392

benbovy commented Mar 21, 2022 •

edited

Loading

keewis commented Mar 27, 2022

max-sixty commented Mar 27, 2022

benbovy commented Mar 28, 2022

max-sixty commented Mar 30, 2022

benbovy commented Sep 28, 2022 •

edited

Loading

benbovy commented Oct 25, 2022

Pass indexes to the Dataset and DataArray constructors #6392

Pass indexes to the Dataset and DataArray constructors #6392

Comments

benbovy commented Mar 21, 2022 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

An example with a pandas multi-index

How to generalize to any (custom) index?

Describe alternatives you've considered

Also allow passing index types (and build options) via indexes

Pass multi-indexes once, grouped by coordinate names

Additional context

keewis commented Mar 27, 2022

max-sixty commented Mar 27, 2022

benbovy commented Mar 28, 2022

max-sixty commented Mar 30, 2022

benbovy commented Sep 28, 2022 • edited Loading

benbovy commented Oct 25, 2022

benbovy commented Mar 21, 2022 •

edited

Loading

Also allow passing index types (and build options) via `indexes`

benbovy commented Sep 28, 2022 •

edited

Loading