Indexing a MultiIndex with a (Multi)Index #15472

toobaz · 2017-02-22T01:04:08Z

Code Sample, a copy-pastable example if possible

In [2]: s = pd.Series(range(8), index=pd.MultiIndex.from_product([[1,2], [3,4], [3,4]],
                                                                 names=['a', 'b', 'c']))

In [3]: s.loc[s.index] # Works as expected
Out[3]: 
a  b  c
1  3  3    0
      4    1
   4  3    2
      4    3
2  3  3    4
      4    5
   4  3    6
      4    7
dtype: int64

In [4]: s.loc[s.iloc[2:-1].index] # Works as expected
Out[4]: 
a  b  c
1  4  3    2
      4    3
2  3  3    4
      4    5
   4  3    6
dtype: int64

In [5]: s.loc[s.index.droplevel('c')] # Just reindexes... weird
Out[5]: 
1  3   NaN
   3   NaN
   4   NaN
   4   NaN
2  3   NaN
   3   NaN
   4   NaN
   4   NaN
dtype: float64

In [6]: s.loc[s.index.droplevel(['b', 'c']), :] # Works (flat index)
Out[6]: 
a  b  c
1  3  3    0
      4    1
   4  3    2
      4    3
2  3  3    4
      4    5
   4  3    6
      4    7
dtype: int64

In [7]: s.loc[s.index.droplevel(['b', 'c'])] #... but fails if I use the shortened notation!
[...]
TypeError: unhashable type: 'Int64Index'

In [8]: s.loc[s.swaplevel('b', 'c')] # Works
Out[8]: 
a  b  c
1  3  3    0
      4    1
   4  3    2
      4    3
2  3  3    4
      4    5
   4  3    6
      4    7
dtype: int64

In [9]: s.loc[s.index.swaplevel('b', 'c')]  # Different result! (reindexes)
Out[9]: 
a  c  b
1  3  3    0
   4  3    2
   3  4    1
   4  4    3
2  3  3    4
   4  3    6
   3  4    5
   4  4    7
dtype: int64

In [10]: s.loc[pd.MultiIndex.from_product([[1,2], [3], [4]],
                                          names=['a', 'c', 'b'])] # Does not respect column names!
Out[10]: 
a  c  b
1  3  4    1
2  3  4    5
dtype: int64

Problem description

This clearly needs a unified approach (and I can try).

Expected Output

I guess most expected outputs above are obvious, except for In [10]: (and maybe In [5]:, which however is already discussed elsewhere). That is: it is not obvious whether level names in the indexer should be matched to level names in the indexed, when both are set (see this comment). It would probably be more pandas-ish if they were.

In other terms, while there is no doubt that

Out[10]: 
a  c  b
1  3  4    1
2  3  4    5
dtype: int64

is wrong, we must decide whether we want

Out[10]: 
a  b  c
1  3  4    1
2  3  4    5
dtype: int64

or

Out[10]: 
a  b  c
1  4  3    2
2  4  3    6
dtype: int64

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.7.0-1-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: it_IT.utf8 LOCALE: it_IT.UTF-8

pandas: 0.19.0+478.g12f2c6a
pytest: 3.0.6
pip: 8.1.2
setuptools: 28.0.0
Cython: 0.23.4
numpy: 1.12.0
scipy: 0.18.1
xarray: None
IPython: 5.1.0.dev
sphinx: 1.4.8
patsy: 0.3.0-dev
dateutil: 2.5.3
pytz: 2015.7
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.0
feather: None
matplotlib: 2.0.0rc2
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.999
httplib2: 0.9.1
apiclient: 1.5.2
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-23T17:19:05Z

I guess. This is going down a rabbit whole (but one that maybe needs some attention). I am not sure what .loc of a not-as-many-levels as the main index should actually do. Can you provide a rationale / use-case here? Why should we just disallow allow this completly?

toobaz · 2017-02-23T18:13:30Z

I guess an example would be like:

In [2]: population = pd.DataFrame([['Europe', 'Italy', 'Rome', 2870336],
   ...:                            ['Europe', 'Italy', 'Naples', 975260],
   ...:                            ['Europe', 'France', 'Paris', 2229621],
   ...:                            ['North America', 'USA', 'New York', 19795791]],
   ...:                           columns=['continent', 'country', 'city', 'pop']).set_index(['continent',
   ...:                                                                                       'country',
   ...:                                                                                       'city'])

In [3]: good_pizza = pd.DataFrame([['Europe', 'Italy', True],
   ...:                            ['Europe', 'France', False],
   ...:                            ['North America', 'USA', False]],
   ...:                           columns=['continent', 'country', 'actually']).set_index(['continent', 'country'])['actually']

In [4]: # Worldwide access to good pizza:
   ...: population.loc[good_pizza[good_pizza].index]#.sum()
Out[4]: 
                   pop
continent country     
Europe    Italy    NaN

In [5]: # ... which should return instead the equivalent of...
   ...: population.loc[population.index.droplevel('city').isin(good_pizza[good_pizza].index)]#.sum()
Out[6]: 
                              pop
continent country city           
Europe    Italy   Rome    2870336
                  Naples   975260

Admittedly, nothing you couldn't do with some .join() and .reset(). But

this is simpler to call and easier to understand
it is weird to accept complete (in terms of number of levels) multiindexes, incomplete tuples (as we already do) but not incomplete multiindexes.
I don't think it would be so complicated to implement. It would actually be very simple if we don't care about level names (just convert to list), and I think not much harder if we do (which would be great - I can try if we like the approach)

toobaz · 2017-02-23T18:16:09Z

(4. indexing with a flat index is broken too, and this is really unexpected, so something should be done anyway)

jreback added API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Feb 23, 2017

toobaz mentioned this issue Dec 1, 2017

errors and inconsistent behaviour when using a DataFrame or a boolean Series as an index #18579

Closed

mroeschke added Bug and removed API Design labels May 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing a MultiIndex with a (Multi)Index #15472

Indexing a MultiIndex with a (Multi)Index #15472

toobaz commented Feb 22, 2017

jreback commented Feb 23, 2017

toobaz commented Feb 23, 2017

toobaz commented Feb 23, 2017

Indexing a MultiIndex with a (Multi)Index #15472

Indexing a MultiIndex with a (Multi)Index #15472

Comments

toobaz commented Feb 22, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Feb 23, 2017

toobaz commented Feb 23, 2017

toobaz commented Feb 23, 2017

Output of `pd.show_versions()`