Scalar slice of MultiIndex is turned to tuples #3432

Hoeze · 2019-10-22T18:55:52Z

Today I updated to v0.14 of xarray and it broke some of my code.

I tried to select one observation of the following dataset:

<xarray.Dataset>
Dimensions:       (genes: 31523, observations: 236)
Coordinates:
  * genes         (genes) object 'ENSG00000227232' ... 'ENSG00000232254'
  * observations  (observations) MultiIndex
  - individual    (observations) object 'GTEX-111YS' ... 'GTEX-ZXG5'
  - subtissue     (observations) object 'Whole_Blood' ... 'Whole_Blood'
Data variables:
    [...]

ds.isel(observations=1):

<xarray.Dataset>
Dimensions:       (genes: 31523)
Coordinates:
  * genes         (genes) object 'ENSG00000227232' ... 'ENSG00000232254'
    observations  object ('GTEX-1122O', 'Whole_Blood')
Data variables:
    [...]

As you can see, observations is now a tuple of ('GTEX-1122O', 'Whole_Blood').
However, the individual and the subtissue should be kept as coordinates.

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-514.16.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1

xarray: 0.14.0
pandas: 0.25.1
numpy: 1.17.2
scipy: 1.3.1
netCDF4: 1.4.2
pydap: None
h5netcdf: 0.7.4
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.5.2
distributed: 2.5.2
matplotlib: 3.1.1
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 41.4.0
pip: 19.2.3
conda: None
pytest: 5.0.1
IPython: 7.8.0
sphinx: None

The text was updated successfully, but these errors were encountered:

max-sixty · 2019-10-22T19:36:51Z

Do you have a reproducible example, as per the issue instructions?

Hoeze · 2019-10-22T20:14:38Z

@max-sixty here you go:

import xarray as xr

print(xr.__version__)

ds = xr.Dataset({
    "test": xr.DataArray(
        [[[1,2],[3,4]], [[1,2],[3,4]]], 
        dims=("genes", "individuals", "subtissues"), 
        coords={
            "genes": ["a", "b"], 
            "individuals": ["c", "d"], 
            "subtissues": ["e", "f"],
        }
    )
})
print(ds)

stacked = ds.stack(observations=["individuals", "subtissues"]) 
print(stacked)

print(stacked.isel(observations=1))

result:

<xarray.Dataset>
Dimensions:       (genes: 2)
Coordinates:
  * genes         (genes) <U1 'a' 'b'
    observations  object ('c', 'f')
Data variables:
    test          (genes) int64 2 2

crusaderky · 2019-10-22T21:07:19Z

Not a regression. I've gone back as far as xarray 0.12 and pandas 0.19 and it's always been like this.
I agree it's bad and needs to be fixed though.

The issue is inherited straight from pandas:

>>> df = stacked.test.to_pandas()
>>> df

individuals  c     d   
subtissues   e  f  e  f
genes                  
a            1  2  3  4
b            1  2  3  4
>>> df.iloc[:, 1]

genes
a    2
b    2
Name: (c, f), dtype: int64

I'm not sure if we should write an ad-hoc object in xarray for scalar multiindices.

The alternative is to think of a more systematic solution in pandas, which likely implies creating an ad-hoc subclass of tuple which is basically a pickle-able namedtuple.
It must be a subclass of tuple otherwise it will break things for a lot of people around the world (the userbase of pandas is MUCH larger than xarray's). And it must be serializable for obvious reasons.

In both cases, the size of this change is very large.

The third and significantly easier option is that, on sel/isel, xarray should automatically unstack any scalar slices of a multiindex. Meaning that the 'observations' coord would simply disappear, leaving only 'individuals' and 'subtissues'.
However, It would carry the problem that, if one cuts a scalar slice and a vector slice from the dimension, he won't be able to concatenate them back together.

@shoyer what's your opinion?

shoyer · 2019-10-22T23:24:45Z

I think the right long-term solution for xarray is to always store separate Variable objects for MultiIndex levels, and only use the MultiIndex for proper indexing. When you index out a single value, the MultiIndex will naturally disappear and you'll be left with a bunch of scalar coordinates, without any special case logic to handle the MultiIndex.

This looks like @crusaderky's third option.

We'll need to finish up the big "explicit indexes" refactor first to make this viable.

benbovy · 2021-09-15T12:38:29Z

@Hoeze this is now implemented in #5692 (stack is not yet refactored so I reproduced your example in a slightly different way):

>>> stacked.isel(observations=1)
<xarray.Dataset>
Dimensions:       (genes: 2)
Coordinates:
  * genes         (genes) <U1 'a' 'b'
    observations  object ('c', 'f')
    individuals   <U1 'c'
    subtissues    <U1 'f'
Data variables:
    test          (genes) int64 2 2

crusaderky changed the title ~~Regression in v0.14? Dimensions are being dropped!~~ Regression in v0.14? MultiIndex is turned to tuples Oct 22, 2019

crusaderky changed the title ~~Regression in v0.14? MultiIndex is turned to tuples~~ Scalar slice of MultiIndex is turned to tuples Oct 22, 2019

dcherian added the topic-indexing label Sep 1, 2020

benbovy mentioned this issue Sep 15, 2021

Explicit indexes #5692

Merged

54 tasks

shoyer closed this as completed in #5692 Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalar slice of MultiIndex is turned to tuples #3432

Scalar slice of MultiIndex is turned to tuples #3432

Hoeze commented Oct 22, 2019 •

edited

Loading

max-sixty commented Oct 22, 2019

Hoeze commented Oct 22, 2019 •

edited

Loading

crusaderky commented Oct 22, 2019

shoyer commented Oct 22, 2019

benbovy commented Sep 15, 2021

Scalar slice of MultiIndex is turned to tuples #3432

Scalar slice of MultiIndex is turned to tuples #3432

Comments

Hoeze commented Oct 22, 2019 • edited Loading

Output of xr.show_versions()

max-sixty commented Oct 22, 2019

Hoeze commented Oct 22, 2019 • edited Loading

crusaderky commented Oct 22, 2019

shoyer commented Oct 22, 2019

benbovy commented Sep 15, 2021

Hoeze commented Oct 22, 2019 •

edited

Loading

Output of `xr.show_versions()`

Hoeze commented Oct 22, 2019 •

edited

Loading