Refactor swap dims #8911

benbovy · 2024-04-05T08:45:49Z

Attempt at fixing rename_vars followed by swap_dims and merge causes swapped dim to reappear #8646
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

I've tried here re-implementing swap_dims using rename_dims, drop_indexes and set_xindex. This fixes the example in #8646 but unfortunately this fails at handling the pandas multi-index special case (i.e., a single non-dimension coordinate wrapping a pd.MultiIndex that is promoted to a dimension coordinate in swap-dims auto-magically results in a PandasMultiIndex with both dimension and level coordinates).

dcherian · 2024-04-05T15:13:56Z

the pandas multi-index special case

I've been starting to think of multiindex as a model for a user defined custom index wrapping multiple variables.

Would we want to recreate a user's custom index with this swap_dims op? Probably not right?

dcherian · 2024-04-05T15:34:40Z

Does this fix #8914
(found by hypothesis)

benbovy · 2024-04-05T16:07:49Z

I've been starting to think of multiindex as a model for a user defined custom index wrapping multiple variables.

Yes this is a good model. That's what causes a lot of trouble since the index refactor as we transitioned from a dimension coordinate - with virtual level coordinates - to the model that you describe here.

Would we want to recreate a user's custom index with this swap_dims op? Probably not right?

No I don't think we want this. In both main and this PR no multi-index is recreated:

import pandas as pd
import xarray as xr

midx = pd.MultiIndex.from_product([["a", "b"], [0, 1]], names=["foo", "bar"])
coords = xr.Coordinates.from_pandas_multiindex(midx, "x")
coords["y"] = ("x", [1, 2, 3, 4])
ds = xr.Dataset(coords=coords)

ds.swap_dims(x="y")
# main: multi-index "x", "foo", "bar" is dropped (although the repr still shows "MultiIndex" for "x" values)
# this PR: ValueError, cannot remove index "x" which would corrupt index built from coordinates "x", "foo", "bar"

The error raised in this PR might make sense, or alternatively we can easily get all coordinates ("x", "foo", "bar") and drop the index.

However, like the Dataset constructor and a few other methods, swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:

ds2 = xr.Dataset(coords={"y": ("x", midx), "x": [1, 2, 3, 4]})

ds2
# <xarray.Dataset> Size: 64B
# Dimensions:  (x: 4)
# Coordinates:
#     y        (x) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
#   * x        (x) int64 32B 1 2 3 4
# Data variables:
#     *empty*


ds2.swap_dims(x="y")
# (in main branch)
# <xarray.Dataset> Size: 128B
# Dimensions:  (y: 4)
# Coordinates:
#   * y        (y) object 32B MultiIndex
#   * foo      (y) object 32B 'a' 'a' 'b' 'b'
#   * bar      (y) int64 32B 0 1 0 1
#     x        (y) int64 32B 1 2 3 4
# Data variables:
#    *empty*

This feels too magical to me, and makes the internal logic unnecessarily complicated. I'm not sure if there is an easy way to support that with the refactoring in this PR.

I'd be in favor of treating any multi-index passed directly as a single coordinate with tuples (#8140), i.e.,

ds2.swap_dims(x="y")
# Dimensions:  (x: 4)
# Coordinates:
#   * y        (y) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
#     x        (y) int64 32B 1 2 3 4
# Data variables:
#     *empty*

with a warning raised.

dcherian · 2024-04-05T16:50:15Z

On main i see no warning:

which looks like

or alternatively we can easily get all coordinates ("x", "foo", "bar") and drop the index.

I think this behaviour is good ^: we should preserve all variables, just destroy any fancy structure

swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:

I support deprecating and deleting this behaviour

max-sixty · 2024-04-17T16:45:29Z

However, like the Dataset constructor and a few other methods, swap_dims supports auto-detecting / auto-promoting pd.MultiIndex objects:
ds2 = xr.Dataset(coords={"y": ("x", midx), "x": [1, 2, 3, 4]})

ds2
# <xarray.Dataset> Size: 64B
# Dimensions:  (x: 4)
# Coordinates:
#     y        (x) object 32B ('a', 0) ('a', 1) ('b', 0) ('b', 1)
#   * x        (x) int64 32B 1 2 3 4
# Data variables:
#     *empty*


ds2.swap_dims(x="y")
# (in main branch)
# <xarray.Dataset> Size: 128B
# Dimensions:  (y: 4)
# Coordinates:
#   * y        (y) object 32B MultiIndex
#   * foo      (y) object 32B 'a' 'a' 'b' 'b'
#   * bar      (y) int64 32B 0 1 0 1
#     x        (y) int64 32B 1 2 3 4
# Data variables:
#    *empty*
This feels too magical to me, and makes the internal logic unnecessarily complicated. I'm not sure if there is an easy way to support that with the refactoring in this PR.

I've been a fan of MultiIndexes and simple pandas compatibility, but definitely agree that the "a 1D array of tuples magically becomes a MultiIndex" behavior would be fine / good to remove...

benbovy added 2 commits April 5, 2024 09:44

add regression test

653fa40

refactor swap_dims using atomic ops

4102b9f

benbovy mentioned this pull request Apr 5, 2024

rename_vars followed by swap_dims and merge causes swapped dim to reappear #8646

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor swap dims #8911

Refactor swap dims #8911

benbovy commented Apr 5, 2024

dcherian commented Apr 5, 2024

dcherian commented Apr 5, 2024 •

edited

Loading

benbovy commented Apr 5, 2024 •

edited

Loading

dcherian commented Apr 5, 2024

max-sixty commented Apr 17, 2024 •

edited

Loading

Refactor swap dims #8911

Are you sure you want to change the base?

Refactor swap dims #8911

Conversation

benbovy commented Apr 5, 2024

dcherian commented Apr 5, 2024

dcherian commented Apr 5, 2024 • edited Loading

benbovy commented Apr 5, 2024 • edited Loading

dcherian commented Apr 5, 2024

max-sixty commented Apr 17, 2024 • edited Loading

dcherian commented Apr 5, 2024 •

edited

Loading

benbovy commented Apr 5, 2024 •

edited

Loading

max-sixty commented Apr 17, 2024 •

edited

Loading