-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor swap dims #8911
base: main
Are you sure you want to change the base?
Refactor swap dims #8911
Conversation
I've been starting to think of multiindex as a model for a user defined custom index wrapping multiple variables. Would we want to recreate a user's custom index with this swap_dims op? Probably not right? |
Does this fix #8914 |
Yes this is a good model. That's what causes a lot of trouble since the index refactor as we transitioned from a dimension coordinate - with virtual level coordinates - to the model that you describe here.
No I don't think we want this. In both import pandas as pd
import xarray as xr
midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=["foo", "bar"])
coords = xr.Coordinates.from_pandas_multiindex(midx, "x")
coords["y"] = ("x", [1, 2, 3, 4])
ds = xr.Dataset(coords=coords)
ds.swap_dims(x="y")
# main: multi-index "x", "foo", "bar" is dropped (although the repr still shows "MultiIndex" for "x" values)
# this PR: ValueError, cannot remove index "x" which would corrupt index built from coordinates "x", "foo", "bar" The error raised in this PR might make sense, or alternatively we can easily get all coordinates ("x", "foo", "bar") and drop the index. However, like the Dataset constructor and a few other methods, ds2 = xr.Dataset(coords={"y": ("x", midx), "x": [1, 2, 3, 4]})
ds2
# <xarray.Dataset> Size: 64B
# Dimensions: (x: 4)
# Coordinates:
# y (x) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
# * x (x) int64 32B 1 2 3 4
# Data variables:
# *empty*
ds2.swap_dims(x="y")
# (in main branch)
# <xarray.Dataset> Size: 128B
# Dimensions: (y: 4)
# Coordinates:
# * y (y) object 32B MultiIndex
# * foo (y) object 32B 'a' 'a' 'b' 'b'
# * bar (y) int64 32B 0 1 0 1
# x (y) int64 32B 1 2 3 4
# Data variables:
# *empty* This feels too magical to me, and makes the internal logic unnecessarily complicated. I'm not sure if there is an easy way to support that with the refactoring in this PR. I'd be in favor of treating any multi-index passed directly as a single coordinate with tuples (#8140), i.e., ds2.swap_dims(x="y")
# Dimensions: (x: 4)
# Coordinates:
# * y (y) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
# x (y) int64 32B 1 2 3 4
# Data variables:
# *empty* with a warning raised. |
which looks like
I think this behaviour is good ^: we should preserve all variables, just destroy any fancy structure
I support deprecating and deleting this behaviour |
I've been a fan of MultiIndexes and simple pandas compatibility, but definitely agree that the "a 1D array of tuples magically becomes a MultiIndex" behavior would be fine / good to remove... |
rename_vars
followed byswap_dims
andmerge
causes swapped dim to reappear #8646whats-new.rst
api.rst
I've tried here re-implementing
swap_dims
usingrename_dims
,drop_indexes
andset_xindex
. This fixes the example in #8646 but unfortunately this fails at handling the pandas multi-index special case (i.e., a single non-dimension coordinate wrapping apd.MultiIndex
that is promoted to a dimension coordinate inswap-dims
auto-magically results in aPandasMultiIndex
with both dimension and level coordinates).