Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex listed multiple times in Dataset.indexes property #6752

Closed
4 tasks done
lukasbindreiter opened this issue Jul 4, 2022 · 8 comments
Closed
4 tasks done

Comments

@lukasbindreiter
Copy link
Contributor

lukasbindreiter commented Jul 4, 2022

What happened?

When upgrading to 2022.6.0.rc0 from 2022.3.0 I noticed a possible unexpected breaking change in the Dataset.indexes property. MultiIndices are now listed for each dimension they apply for as well as once for the multi index itself when accessing dataset.indexes.

What did you expect to happen?

Same behaviour as before, see example below.

Minimal Complete Verifiable Example

# execute with 2022.3.0 and 2022.6.0.rc0 to see the differences
import pandas
import xarray as xr

def _create_multiindex(**kwargs):
    return pandas.MultiIndex.from_arrays(list(kwargs.values()), names=kwargs.keys())


ds = xr.Dataset()
ds.coords["measurement"] = _create_multiindex(
    observation=["A", "A", "B", "B"],
    wavelength=[0.4, 0.5, 0.6, 0.7],
    stokes=["I", "Q", "I", "I"],
)

for name, idx in ds.indexes.items():
    print(name, idx)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

Output with version 2022.3.0:

measurement MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           names=['observation', 'wavelength', 'stokes'])


Output with version 2022.6.0.rc0:

measurement MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           name='measurement')
observation MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           name='measurement')
wavelength MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           name='measurement')
stokes MultiIndex([('A', 0.4, 'I'),
            ('A', 0.5, 'Q'),
            ('B', 0.6, 'I'),
            ('B', 0.7, 'I')],
           name='measurement')

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.8.10 (default, Jan 28 2022, 09:41:12)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.102.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.5
libnetcdf: 4.6.3

xarray: 2022.3.0
pandas: 1.4.3
numpy: 1.23.0
scipy: 1.9.0rc1
netCDF4: 1.5.4
pydap: None
h5netcdf: None
h5py: 3.7.0
Nio: None
zarr: None
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3b3
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.5.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
setuptools: 56.0.0
pip: 21.3.1
conda: None
pytest: 7.1.2
IPython: 8.4.0
sphinx: 4.5.0

INSTALLED VERSIONS

commit: None
python: 3.8.10 (default, Jan 28 2022, 09:41:12)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.102.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.5
libnetcdf: 4.6.3

xarray: 2022.6.0rc0
pandas: 1.4.3
numpy: 1.23.0
scipy: 1.9.0rc1
netCDF4: 1.5.4
pydap: None
h5netcdf: None
h5py: 3.7.0
Nio: None
zarr: None
cftime: 1.6.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3b3
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.5.2
cartopy: 0.19.0.post1
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 56.0.0
pip: 21.3.1
conda: None
pytest: 7.1.2
IPython: 8.4.0
sphinx: 4.5.0

@lukasbindreiter lukasbindreiter added bug needs triage Issue that has not been reviewed by xarray team member labels Jul 4, 2022
@dcherian
Copy link
Contributor

dcherian commented Jul 5, 2022

Thanks for trying out our pre-release @lukasbindreiter !

This is an intentional change.

Can you tell us more about why this breaks your code?

@dcherian dcherian closed this as completed Jul 5, 2022
@dcherian dcherian reopened this Jul 5, 2022
@dcherian dcherian removed bug needs triage Issue that has not been reviewed by xarray team member labels Jul 5, 2022
@benbovy
Copy link
Member

benbovy commented Aug 31, 2022

The change is because starting from version 2022.6.0, multi-index level coordinates are no longer "virtual" but now correspond to real coordinates. The .indexes and .xindexes properties are mappings relating coordinates to their index.

There has been some discussions prior to the explicit indexes refactor about whether those properties should return a mapping of a unique vs. non-unique index objects. We choose the latter as it simplifies a lot of things internally (and perhaps externally too).

@lukasbindreiter although it is unlikely that we'll change this in the future, it would be interesting to get your feedback! How does this choice impact your workflow?

Note that both .indexes and .xindexes return an Indexes object, which has a convenient .get_unique() method that returns a list of unique index objects. It also has other convenient methods, although those are not well documented yet.

@lukasbindreiter
Copy link
Contributor Author

We used the .indexes property to implement a workaround for serializing Datasets containing multiindices to netCDF. For this the implementation basically looked like this:
(Inspired by and related to this issue: #1077)

Saving dataset as NetCDF:

  1. Loop over dataset.indexes
  2. Check if index is a multi index
  3. If so, encode it somehow and save it as attribute in the dataset
  4. Reset (remove) the index
  5. Now save this "patched" dataset as NetCDF

And then loading it again:

  1. Load the dataset
  2. Check if this special attribute exists
  3. If so decode the multiindex and set it as index in the dataset

When testing the pre-release version I noticed some of our tests failing, which is why I raised this issue in the first place - in case those changes were unwanted. I was not aware that you were actively working on multi index changes and therefore expecting API changes here. With that in mind I'll probably be able to adapt our code to this new API of indexes and xindexes.

@lukasbindreiter
Copy link
Contributor Author

@benbovy I also just tested the get_unique() method that you mentioned and maybe noticed a related issue here, which I'm not sure is wanted / expected.

Taking the above dataset ds, accessing this function results in an error:

> ds.indexes.get_unique()

TypeError: unhashable type: 'MultiIndex'

However, for xindexes it works:

> ds.xindexes.get_unique()

[<xarray.core.indexes.PandasMultiIndex at 0x7f105bf1df20>]

@benbovy
Copy link
Member

benbovy commented Sep 5, 2022

Thanks for the issue report @lukasbindreiter, I opened #6987. As a workaround, you could use Indexes.group_by_index(), which shouldn't have any hash issue and which might be better fitted for your use case.

Regarding (de)serialization (from)to netCDF or other formats, I wonder if building multi-indexes or other custom indexes when opening the dataset couldn't be done via some custom Xarray IO backend (https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html). I'm not sure how easy / hard it is to implement a custom backend on top of an existing one, though. For the serialization, Xarray doesn't support custom writable backends (yet), but since multi-index levels are now real coordinates maybe a custom backend is not really needed. Right now Xarray raises NotImplementedError when trying to save a variable wrapping a multi-index, but probably we could just get rid of the multi-index "dimension" coordinate (tuple elements) and save level coordinates like any other variable.

@lukasbindreiter
Copy link
Contributor Author

Thanks for the suggestions, I'll look into this Indexes.group_by_index and see if that is able to resolve our issue.

And with regards to the (de)serialization: I haven't investigated yet how the index changes in 2022.6 affect our initial usecase, maybe an approach such as the suggested with a custom backend may be even a better solution for us then.
Though probably we would ideally need a way to override NetCDF4BackendEntrypoint with another Entrypoint as the default for .nc files

As for the original issue discussed here: That can probably be closed then, since it was an intentional change.
But finding information about those changes right now was not so easy, is there some resource available where I can read up about the changes to indexes and functions related to them. (e.g. I was unaware of the existence of xindexes or Indexes.group_by_index)?

@benbovy
Copy link
Member

benbovy commented Sep 5, 2022

But finding information about those changes right now was not so easy, is there some resource available where I can read up about the changes to indexes and functions related to them.

Not yet, this still has to be detailed in the documentation (tracked in #6293 along with other todo items related to indexes). The Indexes API already has some basic docstrings, though: https://github.com/pydata/xarray/blob/main/xarray/core/indexes.py#L1008-L1225

@benbovy
Copy link
Member

benbovy commented Sep 5, 2022

That can probably be closed then, since it was an intentional change.

Yes I think we can close it. Thanks for your feedback and for the issue report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants