Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing a MultiIndex with a (Multi)Index #15472

Open
toobaz opened this issue Feb 22, 2017 · 3 comments
Open

Indexing a MultiIndex with a (Multi)Index #15472

toobaz opened this issue Feb 22, 2017 · 3 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@toobaz
Copy link
Member

toobaz commented Feb 22, 2017

Code Sample, a copy-pastable example if possible

In [2]: s = pd.Series(range(8), index=pd.MultiIndex.from_product([[1,2], [3,4], [3,4]],
                                                                 names=['a', 'b', 'c']))

In [3]: s.loc[s.index] # Works as expected
Out[3]: 
a  b  c
1  3  3    0
      4    1
   4  3    2
      4    3
2  3  3    4
      4    5
   4  3    6
      4    7
dtype: int64

In [4]: s.loc[s.iloc[2:-1].index] # Works as expected
Out[4]: 
a  b  c
1  4  3    2
      4    3
2  3  3    4
      4    5
   4  3    6
dtype: int64

In [5]: s.loc[s.index.droplevel('c')] # Just reindexes... weird
Out[5]: 
1  3   NaN
   3   NaN
   4   NaN
   4   NaN
2  3   NaN
   3   NaN
   4   NaN
   4   NaN
dtype: float64

In [6]: s.loc[s.index.droplevel(['b', 'c']), :] # Works (flat index)
Out[6]: 
a  b  c
1  3  3    0
      4    1
   4  3    2
      4    3
2  3  3    4
      4    5
   4  3    6
      4    7
dtype: int64

In [7]: s.loc[s.index.droplevel(['b', 'c'])] #... but fails if I use the shortened notation!
[...]
TypeError: unhashable type: 'Int64Index'

In [8]: s.loc[s.swaplevel('b', 'c')] # Works
Out[8]: 
a  b  c
1  3  3    0
      4    1
   4  3    2
      4    3
2  3  3    4
      4    5
   4  3    6
      4    7
dtype: int64

In [9]: s.loc[s.index.swaplevel('b', 'c')]  # Different result! (reindexes)
Out[9]: 
a  c  b
1  3  3    0
   4  3    2
   3  4    1
   4  4    3
2  3  3    4
   4  3    6
   3  4    5
   4  4    7
dtype: int64

In [10]: s.loc[pd.MultiIndex.from_product([[1,2], [3], [4]],
                                          names=['a', 'c', 'b'])] # Does not respect column names!
Out[10]: 
a  c  b
1  3  4    1
2  3  4    5
dtype: int64

Problem description

This clearly needs a unified approach (and I can try).

Expected Output

I guess most expected outputs above are obvious, except for In [10]: (and maybe In [5]:, which however is already discussed elsewhere). That is: it is not obvious whether level names in the indexer should be matched to level names in the indexed, when both are set (see this comment). It would probably be more pandas-ish if they were.

In other terms, while there is no doubt that

Out[10]: 
a  c  b
1  3  4    1
2  3  4    5
dtype: int64

is wrong, we must decide whether we want

Out[10]: 
a  b  c
1  3  4    1
2  3  4    5
dtype: int64

or

Out[10]: 
a  b  c
1  4  3    2
2  4  3    6
dtype: int64

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.7.0-1-amd64 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: it_IT.utf8 LOCALE: it_IT.UTF-8

pandas: 0.19.0+478.g12f2c6a
pytest: 3.0.6
pip: 8.1.2
setuptools: 28.0.0
Cython: 0.23.4
numpy: 1.12.0
scipy: 0.18.1
xarray: None
IPython: 5.1.0.dev
sphinx: 1.4.8
patsy: 0.3.0-dev
dateutil: 2.5.3
pytz: 2015.7
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.0
feather: None
matplotlib: 2.0.0rc2
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.999
httplib2: 0.9.1
apiclient: 1.5.2
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_datareader: 0.2.1

@jreback
Copy link
Contributor

jreback commented Feb 23, 2017

I guess. This is going down a rabbit whole (but one that maybe needs some attention). I am not sure what .loc of a not-as-many-levels as the main index should actually do. Can you provide a rationale / use-case here? Why should we just disallow allow this completly?

@jreback jreback added API Design Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Feb 23, 2017
@toobaz
Copy link
Member Author

toobaz commented Feb 23, 2017

I guess an example would be like:

In [2]: population = pd.DataFrame([['Europe', 'Italy', 'Rome', 2870336],
   ...:                            ['Europe', 'Italy', 'Naples', 975260],
   ...:                            ['Europe', 'France', 'Paris', 2229621],
   ...:                            ['North America', 'USA', 'New York', 19795791]],
   ...:                           columns=['continent', 'country', 'city', 'pop']).set_index(['continent',
   ...:                                                                                       'country',
   ...:                                                                                       'city'])

In [3]: good_pizza = pd.DataFrame([['Europe', 'Italy', True],
   ...:                            ['Europe', 'France', False],
   ...:                            ['North America', 'USA', False]],
   ...:                           columns=['continent', 'country', 'actually']).set_index(['continent', 'country'])['actually']

In [4]: # Worldwide access to good pizza:
   ...: population.loc[good_pizza[good_pizza].index]#.sum()
Out[4]: 
                   pop
continent country     
Europe    Italy    NaN

In [5]: # ... which should return instead the equivalent of...
   ...: population.loc[population.index.droplevel('city').isin(good_pizza[good_pizza].index)]#.sum()
Out[6]: 
                              pop
continent country city           
Europe    Italy   Rome    2870336
                  Naples   975260

Admittedly, nothing you couldn't do with some .join() and .reset(). But

  1. this is simpler to call and easier to understand
  2. it is weird to accept complete (in terms of number of levels) multiindexes, incomplete tuples (as we already do) but not incomplete multiindexes.
  3. I don't think it would be so complicated to implement. It would actually be very simple if we don't care about level names (just convert to list), and I think not much harder if we do (which would be great - I can try if we like the approach)

@toobaz
Copy link
Member Author

toobaz commented Feb 23, 2017

(4. indexing with a flat index is broken too, and this is really unexpected, so something should be done anyway)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

3 participants