Inconsistent output when using integer labels in multiindex on both column and index #14969

relativistic · 2016-12-22T22:06:59Z

Description of problem

Forgive me if I'm missing a sublety when using integers for multiindexing, but I seem to be getting inconsistent behavior when using multiindexing. Using loc to index both column and index simultaneously doesn't always give the same result. This seems to depend on the datatype of the innermost index.

Example of the expected behavior

The following example works as I'd expect, giving me a dataframe representing the (0,0) label for the outermost index level:

>>>ind = pd.MultiIndex.from_product([[0,1],['A','B','C','D','E']])
>>>df = pd.DataFrame(np.random.rand(10,10), index=ind, columns=ind)
>>>print(df.loc[0,0])

          A         B         C         D         E
A  0.392093  0.167340  0.292854  0.138955  0.575715
B  0.495728  0.062870  0.733270  0.889761  0.141171
C  0.973444  0.518498  0.648546  0.448096  0.383729
D  0.987809  0.697177  0.601228  0.094184  0.986927
E  0.950939  0.109866  0.151390  0.173802  0.855105

Example of the unexpected behavior

However, if I change the second index level dataype to, for example, floats or ints, loc uses positional indexing rather than label based indexing for the second label. Thus, the same syntax returns a series of a single column, rather than a dataframe.

>>>ind = pd.MultiIndex.from_product([[0,1],np.linspace(0,1,5)])
>>>df = pd.DataFrame(np.random.rand(10,10), index=ind, columns=ind)
>>>print(df.loc[0,0])
0  0.00    0.666874
   0.25    0.023773
   0.50    0.799715
   0.75    0.752675
   1.00    0.935531
1  0.00    0.510080
   0.25    0.845125
   0.50    0.410635
   0.75    0.067144
   1.00    0.658522

Problem description

The problem is that the output is inconsistent. My code breaks depending upon the datatypes used for the indices in a non-obvious way. I would expect things to work as in my first example, with the str dtype used for the second index level. At a minimum, I'd prefer it if the behavior was consistent, regardless of the datatype of the second index level.

Output of `pd.show_versions()`

# Paste the output here pd.show_versions() here INSTALLED VERSIONS ------------------ commit: None python: 2.7.9.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 4.1.2
sphinx: 1.4
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: None
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: None
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-12-22T22:51:42Z

First, .loc is always using label-based indexing, it just looks like it's falling back to positional. I think this actually comes down to the ambiguity of .loc[0, 0], which expands to .loc[(0, 0)]. Does that last 0 mean the columns or the second level of the index? I'll let Jeff chime in on what's correct here, but I suspect that pandas behaving as intended. I'll look through the docs...

For consistency, I'd recommend using an IndexSlice:

In [22]: df.loc[pd.IndexSlice[0, :], 0]
Out[22]:
            A         B         C         D         E
0 A  0.874435  0.673136  0.681053  0.352759  0.829466
  B  0.325829  0.646701  0.739708  0.914715  0.297058
  C  0.239715  0.955735  0.503433  0.270841  0.346910
  D  0.389404  0.322453  0.934790  0.889230  0.563052
  E  0.562889  0.764895  0.459072  0.351296  0.054497

In [23]: df2.loc[pd.IndexSlice[0, :], 0]
Out[23]:
            0.00      0.25      0.50      0.75      1.00
0 0.00  0.187160  0.124317  0.139404  0.014958  0.297874
  0.25  0.688006  0.396273  0.032172  0.285215  0.054483
  0.50  0.053673  0.777064  0.504307  0.698933  0.814912
  0.75  0.873036  0.694500  0.305774  0.550135  0.281881
  1.00  0.472582  0.803392  0.162467  0.299709  0.605152

Using that removes the ambiguity, since you specify all the levels of the index:

In [26]: pd.IndexSlice[0, :], 0
Out[26]: ((0, slice(None, None, None)), 0)

TomAugspurger · 2016-12-22T22:55:14Z

We do have the warning here: http://pandas-docs.github.io/pandas-docs-travis/advanced.html#using-slicers, which touches on it. The example you provided would make for a fantastic addition to the docs (again, assuming that pandas is doing the correct thing here).

jreback · 2016-12-22T23:06:52Z

@TomAugspurger is right, this is correct, passing df.loc[0,0] is not doing what you think it is as its ambiguous.

Here's more clear indexing.

In [10]: df.loc[(0,0.5):(0,0.75),:]
Out[10]:
               0                                                 1
            0.00      0.25      0.50      0.75      1.00      0.00      0.25      0.50      0.75      1.00
0 0.50  0.560134  0.095912  0.224510  0.726047  0.810821  0.379455  0.596147  0.454783  0.904792  0.129607
  0.75  0.133021  0.106895  0.904825  0.901991  0.044659  0.715370  0.296965  0.097234  0.945662  0.610672

The point is you have to be very explicit and specify all dimensions. There is a very large warning on purpose.

So will rerpose this as a doc issue if you would like to add something.

relativistic · 2016-12-22T23:48:39Z

Okay, thanks. I think I see what is happening then. It can't tell the difference between df.loc[(0,0)] and df.loc[(0),(0)], I'm guessing due to the syntax limitations of python itself. I assume I got my expected behavior when I used strings for the 2nd level because then pandas could tell which interpretation to use by context.

Maybe more of a question for stackexchange, but while we're on topic, @TomAugspurger's suggested syntax removes the first level from the columns from the output, but not the first level of the index. I guess there is no way of doing this query while removing the first level from both?

jorisvandenbossche · 2016-12-23T00:00:11Z

@relativistic You can also use df.loc[(0,),0] instead of df2.loc[pd.IndexSlice[0,:],0], that does not preserve the first index level (but we should actually check the consisteny for such cases).

Personally, I think that df.loc[0, 0] should always expand to df.loc[(0,), (0,)], and not to df.loc[(0,0),]. But I suppose this has long been that way ..

TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Dec 22, 2016

jreback added the Docs label Dec 22, 2016

jreback added this to the 0.20.0 milestone Dec 22, 2016

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

This was referenced Jul 30, 2017

BUG: Allow Series with same name with crosstab (#13279) #16028

Merged

Unstacking a MultiIndex with integer names is ambiguous #17123

Open

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent output when using integer labels in multiindex on both column and index #14969

Inconsistent output when using integer labels in multiindex on both column and index #14969

relativistic commented Dec 22, 2016

TomAugspurger commented Dec 22, 2016

TomAugspurger commented Dec 22, 2016

jreback commented Dec 22, 2016

relativistic commented Dec 22, 2016

jorisvandenbossche commented Dec 23, 2016 •

edited

Loading

Inconsistent output when using integer labels in multiindex on both column and index #14969

Inconsistent output when using integer labels in multiindex on both column and index #14969

Comments

relativistic commented Dec 22, 2016

Description of problem

Example of the expected behavior

Example of the unexpected behavior

Problem description

Output of pd.show_versions()

TomAugspurger commented Dec 22, 2016

TomAugspurger commented Dec 22, 2016

jreback commented Dec 22, 2016

relativistic commented Dec 22, 2016

jorisvandenbossche commented Dec 23, 2016 • edited Loading

Output of `pd.show_versions()`

jorisvandenbossche commented Dec 23, 2016 •

edited

Loading