Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalar slice of MultiIndex is turned to tuples #3432

Closed
Hoeze opened this issue Oct 22, 2019 · 5 comments · Fixed by #5692
Closed

Scalar slice of MultiIndex is turned to tuples #3432

Hoeze opened this issue Oct 22, 2019 · 5 comments · Fixed by #5692

Comments

@Hoeze
Copy link

Hoeze commented Oct 22, 2019

Today I updated to v0.14 of xarray and it broke some of my code.

I tried to select one observation of the following dataset:

<xarray.Dataset>
Dimensions:       (genes: 31523, observations: 236)
Coordinates:
  * genes         (genes) object 'ENSG00000227232' ... 'ENSG00000232254'
  * observations  (observations) MultiIndex
  - individual    (observations) object 'GTEX-111YS' ... 'GTEX-ZXG5'
  - subtissue     (observations) object 'Whole_Blood' ... 'Whole_Blood'
Data variables:
    [...]

ds.isel(observations=1):

<xarray.Dataset>
Dimensions:       (genes: 31523)
Coordinates:
  * genes         (genes) object 'ENSG00000227232' ... 'ENSG00000232254'
    observations  object ('GTEX-1122O', 'Whole_Blood')
Data variables:
    [...]

As you can see, observations is now a tuple of ('GTEX-1122O', 'Whole_Blood').
However, the individual and the subtissue should be kept as coordinates.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 3.10.0-514.16.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.4 libnetcdf: 4.6.1

xarray: 0.14.0
pandas: 0.25.1
numpy: 1.17.2
scipy: 1.3.1
netCDF4: 1.4.2
pydap: None
h5netcdf: 0.7.4
h5py: 2.9.0
Nio: None
zarr: 2.3.2
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.5.2
distributed: 2.5.2
matplotlib: 3.1.1
cartopy: None
seaborn: 0.9.0
numbagg: None
setuptools: 41.4.0
pip: 19.2.3
conda: None
pytest: 5.0.1
IPython: 7.8.0
sphinx: None

@max-sixty
Copy link
Collaborator

Do you have a reproducible example, as per the issue instructions?

@Hoeze
Copy link
Author

Hoeze commented Oct 22, 2019

@max-sixty here you go:

import xarray as xr

print(xr.__version__)

ds = xr.Dataset({
    "test": xr.DataArray(
        [[[1,2],[3,4]], [[1,2],[3,4]]], 
        dims=("genes", "individuals", "subtissues"), 
        coords={
            "genes": ["a", "b"], 
            "individuals": ["c", "d"], 
            "subtissues": ["e", "f"],
        }
    )
})
print(ds)

stacked = ds.stack(observations=["individuals", "subtissues"]) 
print(stacked)

print(stacked.isel(observations=1))

result:

<xarray.Dataset>
Dimensions:       (genes: 2)
Coordinates:
  * genes         (genes) <U1 'a' 'b'
    observations  object ('c', 'f')
Data variables:
    test          (genes) int64 2 2

@crusaderky crusaderky changed the title Regression in v0.14? Dimensions are being dropped! Regression in v0.14? MultiIndex is turned to tuples Oct 22, 2019
@crusaderky
Copy link
Contributor

Not a regression. I've gone back as far as xarray 0.12 and pandas 0.19 and it's always been like this.
I agree it's bad and needs to be fixed though.

The issue is inherited straight from pandas:

>>> df = stacked.test.to_pandas()
>>> df

individuals  c     d   
subtissues   e  f  e  f
genes                  
a            1  2  3  4
b            1  2  3  4
>>> df.iloc[:, 1]

genes
a    2
b    2
Name: (c, f), dtype: int64

I'm not sure if we should write an ad-hoc object in xarray for scalar multiindices.

The alternative is to think of a more systematic solution in pandas, which likely implies creating an ad-hoc subclass of tuple which is basically a pickle-able namedtuple.
It must be a subclass of tuple otherwise it will break things for a lot of people around the world (the userbase of pandas is MUCH larger than xarray's). And it must be serializable for obvious reasons.

In both cases, the size of this change is very large.

The third and significantly easier option is that, on sel/isel, xarray should automatically unstack any scalar slices of a multiindex. Meaning that the 'observations' coord would simply disappear, leaving only 'individuals' and 'subtissues'.
However, It would carry the problem that, if one cuts a scalar slice and a vector slice from the dimension, he won't be able to concatenate them back together.

@shoyer what's your opinion?

@crusaderky crusaderky changed the title Regression in v0.14? MultiIndex is turned to tuples Scalar slice of MultiIndex is turned to tuples Oct 22, 2019
@shoyer
Copy link
Member

shoyer commented Oct 22, 2019

I think the right long-term solution for xarray is to always store separate Variable objects for MultiIndex levels, and only use the MultiIndex for proper indexing. When you index out a single value, the MultiIndex will naturally disappear and you'll be left with a bunch of scalar coordinates, without any special case logic to handle the MultiIndex.

This looks like @crusaderky's third option.

We'll need to finish up the big "explicit indexes" refactor first to make this viable.

@benbovy
Copy link
Member

benbovy commented Sep 15, 2021

@Hoeze this is now implemented in #5692 (stack is not yet refactored so I reproduced your example in a slightly different way):

>>> stacked.isel(observations=1)
<xarray.Dataset>
Dimensions:       (genes: 2)
Coordinates:
  * genes         (genes) <U1 'a' 'b'
    observations  object ('c', 'f')
    individuals   <U1 'c'
    subtissues    <U1 'f'
Data variables:
    test          (genes) int64 2 2

@benbovy benbovy mentioned this issue Sep 15, 2021
54 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants