-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass indexes to the Dataset and DataArray constructors #6392
Comments
I wonder if it would help to have a custom type that unlike indexes = {xr.combined("lat", "lon"): idx, xr.combined("z", "x", "y"): multi_index}) This would be immediately normalized to: indexes = {"lat": idx, "lon": idx, "z": multi_index, "x": multi_index, "y": multi_index} |
I realize there's a lot here and I've been out of this thread for a bit, so please forgive any naive questions!
What's the rationale for deprecating this? I think my experience with users of xarray is mostly those coming from pandas; for them interop is quite important. If there's a canonical way of transforming the index, it would be friendlier to do that automatically. import pandas as pd
import xarray as xr
pd_idx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("foo", "bar"))
idx = pd_idx
ds = xr.Dataset(coords={"x": idx}) i.e.
I would have expected the later, both for
👍 |
Yes I agree that interoperability with pandas is important. Providing pandas (multi-)indexes via Now that indexes are really distinct from coordinates, I'd rather expect the following behavior for the case of pandas multi-index: pd_idx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("foo", "bar"))
# convert a pandas multi-index to a numpy array returns level values as tuples
np.array(pd_idx)
# array([('a', 1), ('a', 2), ('b', 1), ('b', 2)], dtype=object)
# simply pass the index as a coordinate would treat it as an array-like, i.e., like numpy does
xr.Dataset(coords={"x": pd_idx})
# <xarray.Dataset>
# Dimensions: (x: 4)
# Coordinates:
# * x (x) object ('a', 1) ('a', 2) ('b', 1) ('b', 2)
# Data variables:
# *empty* In this specific case, I'd favor consistency with how Numpy handles Pandas indexes over more convenient interoperability with Pandas. The array of tuple elements is not very useful, though. There should be ways to create Xarray objects with Pandas indexes, but I think it's better if we eventually pass them via More generally, I don't know how will evolve the ecosystem in the future (how many custom Xarray indexes?). I wonder to which point in Xarray's API we should support special cases for Pandas (multi-)indexes compared to other kinds of indexes. |
Thanks for the thoughtful reply @benbovy (This is a level down and you can make a decision later, so fine if you prefer to push the discussion.) How would we handle creating xarray objects from pandas objects where they have a multiindex? To what extent do you think this is this the "standard case" and we could default to it? idx = xr.PandasMultiIndex(pd_idx, "x")
indexes = {"x": idx, "foo": idx, "bar": idx} |
For For a import pandas as pd
import xarray as xr
from xarray.indexes import PandasMultiIndex
pd_idx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("foo", "bar"))
idx = PandasMultiIndex(pd_idx, "x")
indexes = {"x": idx, "foo": idx, "bar": idx}
coords = idx.create_variables()
ds = xr.Dataset(coords=coords, indexes=indexes) For more convenience, we could add a class method to # this calls PandasMultiIndex.__init__() and PandasMultiIndex.create_variables() internally
indexes, coords = PandasMultiIndex.from_pandas_index(pd_idx, "x")
ds = xr.Dataset(coords=coords, indexes=indexes) Instead of xmidx = PandasMultiIndex.from_pandas_index(pd_idx, "x")
ds = xr.Dataset(coords=xmidx.variables, indexes=xmidx) For even more convenience, I think it might be reasonable to support special handling of # both cases below will implicitly add the coordinates found in `xmidx`
# (if there's no conflict with other coordinates)
ds = xr.Dataset(indexes=xmidx)
ds2 = xr.Dataset()
ds2.update(xmidx) The same approach could be used for |
I'm thinking of only accepting one or more instances of Indexes as
|
Is your feature request related to a problem?
This is part of #6293 (explicit indexes next steps).
Describe the solution you'd like
A
Mapping[Hashable, Index]
would probably be the most obvious (optional) value type accepted for theindexes
argument of the Dataset and DataArray constructors.pros:
xindexes
propertycons:
coords
andindexes
An example with a pandas multi-index
Currently a pandas multi-index may be passed directly as one (dimension) coordinate ; it is then "unpacked" into one dimension (tuple values) coordinate and one or more level coordinates. I would suggest depreciating this behavior in favor of a more explicit (although more verbose) way to pass an existing pandas multi-index:
The cases below should raise an error:
Should we raise an error or simply ignore the index in the case below?
Should we silently reorder the coordinates and/or indexes when the levels are not passed in the right order? It seems odd requiring mapping elements be passed in a given order.
How to generalize to any (custom) index?
With the case of multi-index, it is pretty easy to check whether the coordinates and indexes are consistent because we ensure consistent
pd_idx.names
vs. coordinate names and becauseidx.get_variables()
returns XarrayIndexVariable
objects where variable data wraps the pandas multi-index.However, this may not be easy for other indexes. Some Xarray custom indexes (like a KD-Tree index) likely won't return anything from
.get_variables()
as they don't support wrapping internal data as coordinate data. Right now there's nothing in the XarrayIndex
base class that could help checking consistency between indexes vs. coordinates for any kind of index.How could we solve this?
A. add a
.coords
property to the XarrayIndex
base class, that returns adict[Hashable, IndexVariable]
.xr.PandasMultiIndex(pd_idx, "x")
. Should.coords
returnNone
and return the coordinates returned by the last.get_variables()
call?B. add a
.coord_names
property to the XarrayIndex
base class that returnstuple[Hashable, ...]
, and add a private attribute toIndexVariable
that returns the index object (or return it via a very lightweightIndexAdapter
base class used to wrap variable data).Index.get_variables(variables)
would by default return shallow copies of the input variables with a reference to the index object.coord_names
, i.e., usingtuple[tuple[Hashable, tuple[Hashable, ...]], ...]
.I think I prefer the second option.
Describe alternatives you've considered
Also allow passing index types (and build options) via
indexes
I.e.,
Mapping[Hashable, Index | Type[Index] | tuple[TypeIndex, Mapping[Any, Any]]]
, so that new indexes can be created from the passed coordinates at DataArray or Dataset creation.pros:
cons:
.set_index
is probably better.Pass multi-indexes once, grouped by coordinate names
I.e.,
indexes
keys accept tuples:Mapping[Hashable | tuple[Hashable, ...], Index]
pros:
cons:
.xindexes
propertyAdditional context
No response
The text was updated successfully, but these errors were encountered: