Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent output when using integer labels in multiindex on both column and index #14969

Open
relativistic opened this issue Dec 22, 2016 · 5 comments
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@relativistic
Copy link

Description of problem

Forgive me if I'm missing a sublety when using integers for multiindexing, but I seem to be getting inconsistent behavior when using multiindexing. Using loc to index both column and index simultaneously doesn't always give the same result. This seems to depend on the datatype of the innermost index.

Example of the expected behavior

The following example works as I'd expect, giving me a dataframe representing the (0,0) label for the outermost index level:

>>>ind = pd.MultiIndex.from_product([[0,1],['A','B','C','D','E']])
>>>df = pd.DataFrame(np.random.rand(10,10), index=ind, columns=ind)
>>>print(df.loc[0,0])

          A         B         C         D         E
A  0.392093  0.167340  0.292854  0.138955  0.575715
B  0.495728  0.062870  0.733270  0.889761  0.141171
C  0.973444  0.518498  0.648546  0.448096  0.383729
D  0.987809  0.697177  0.601228  0.094184  0.986927
E  0.950939  0.109866  0.151390  0.173802  0.855105

Example of the unexpected behavior

However, if I change the second index level dataype to, for example, floats or ints, loc uses positional indexing rather than label based indexing for the second label. Thus, the same syntax returns a series of a single column, rather than a dataframe.

>>>ind = pd.MultiIndex.from_product([[0,1],np.linspace(0,1,5)])
>>>df = pd.DataFrame(np.random.rand(10,10), index=ind, columns=ind)
>>>print(df.loc[0,0])
0  0.00    0.666874
   0.25    0.023773
   0.50    0.799715
   0.75    0.752675
   1.00    0.935531
1  0.00    0.510080
   0.25    0.845125
   0.50    0.410635
   0.75    0.067144
   1.00    0.658522

Problem description

The problem is that the output is inconsistent. My code breaks depending upon the datatypes used for the indices in a non-obvious way. I would expect things to work as in my first example, with the str dtype used for the second index level. At a minimum, I'd prefer it if the behavior was consistent, regardless of the datatype of the second index level.

Output of pd.show_versions()

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 2.7.9.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 4.1.2
sphinx: 1.4
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: None
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

First, .loc is always using label-based indexing, it just looks like it's falling back to positional. I think this actually comes down to the ambiguity of .loc[0, 0], which expands to .loc[(0, 0)]. Does that last 0 mean the columns or the second level of the index? I'll let Jeff chime in on what's correct here, but I suspect that pandas behaving as intended. I'll look through the docs...

For consistency, I'd recommend using an IndexSlice:

In [22]: df.loc[pd.IndexSlice[0, :], 0]
Out[22]:
            A         B         C         D         E
0 A  0.874435  0.673136  0.681053  0.352759  0.829466
  B  0.325829  0.646701  0.739708  0.914715  0.297058
  C  0.239715  0.955735  0.503433  0.270841  0.346910
  D  0.389404  0.322453  0.934790  0.889230  0.563052
  E  0.562889  0.764895  0.459072  0.351296  0.054497

In [23]: df2.loc[pd.IndexSlice[0, :], 0]
Out[23]:
            0.00      0.25      0.50      0.75      1.00
0 0.00  0.187160  0.124317  0.139404  0.014958  0.297874
  0.25  0.688006  0.396273  0.032172  0.285215  0.054483
  0.50  0.053673  0.777064  0.504307  0.698933  0.814912
  0.75  0.873036  0.694500  0.305774  0.550135  0.281881
  1.00  0.472582  0.803392  0.162467  0.299709  0.605152

Using that removes the ambiguity, since you specify all the levels of the index:

In [26]: pd.IndexSlice[0, :], 0
Out[26]: ((0, slice(None, None, None)), 0)

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Dec 22, 2016
@TomAugspurger
Copy link
Contributor

We do have the warning here: http://pandas-docs.github.io/pandas-docs-travis/advanced.html#using-slicers, which touches on it. The example you provided would make for a fantastic addition to the docs (again, assuming that pandas is doing the correct thing here).

@jreback
Copy link
Contributor

jreback commented Dec 22, 2016

@TomAugspurger is right, this is correct, passing df.loc[0,0] is not doing what you think it is as its ambiguous.

Here's more clear indexing.

In [10]: df.loc[(0,0.5):(0,0.75),:]
Out[10]:
               0                                                 1
            0.00      0.25      0.50      0.75      1.00      0.00      0.25      0.50      0.75      1.00
0 0.50  0.560134  0.095912  0.224510  0.726047  0.810821  0.379455  0.596147  0.454783  0.904792  0.129607
  0.75  0.133021  0.106895  0.904825  0.901991  0.044659  0.715370  0.296965  0.097234  0.945662  0.610672

The point is you have to be very explicit and specify all dimensions. There is a very large warning on purpose.

So will rerpose this as a doc issue if you would like to add something.

@jreback jreback added the Docs label Dec 22, 2016
@jreback jreback added this to the 0.20.0 milestone Dec 22, 2016
@relativistic
Copy link
Author

Okay, thanks. I think I see what is happening then. It can't tell the difference between df.loc[(0,0)] and df.loc[(0),(0)], I'm guessing due to the syntax limitations of python itself. I assume I got my expected behavior when I used strings for the 2nd level because then pandas could tell which interpretation to use by context.

Maybe more of a question for stackexchange, but while we're on topic, @TomAugspurger's suggested syntax removes the first level from the columns from the output, but not the first level of the index. I guess there is no way of doing this query while removing the first level from both?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 23, 2016

@relativistic You can also use df.loc[(0,),0] instead of df2.loc[pd.IndexSlice[0,:],0], that does not preserve the first index level (but we should actually check the consisteny for such cases).

Personally, I think that df.loc[0, 0] should always expand to df.loc[(0,), (0,)], and not to df.loc[(0,0),]. But I suppose this has long been that way ..

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

5 participants