groupby fails when MultiIndex contains Int64Index in an empty DataFrame in 1.0.0 #31670

jdfinsf · 2020-02-04T22:53:05Z

Code Sample, a copy-pastable example if possible

df = pd.DataFrame(
    [[123, "a", 1.0], [123, "b", 2.0]],
    columns=["id", "category", "value"]
)
df = df.set_index(["id", "category"])
df[df.value < 0].groupby("id").sum()

Problem description

When groupby is over a Int64Index in a MultiIndex for an empty DataFrame, the groupby fails with error: ValueError: Unable to fill values because Int64Index cannot contain NA

Expected Output

The groupby should not raise an error, instead the code above should output an empty DataFrame as would happen for df[df.value < 0].groupby("category").sum()

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.9.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.7.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.0
numpy            : 1.18.1
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 18.1
setuptools       : 41.6.0
Cython           : None
pytest           : 5.1.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.3 (dt dec pq3 ext lo64)
jinja2           : 2.10.1
IPython          : 7.8.0
pandas_datareader: None
bs4              : 4.8.0
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.13.0
pytables         : None
pytest           : 5.1.2
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.6
tables           : None
tabulate         : 0.8.6
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None
numba            : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-02-04T23:00:54Z

Can you post a traceback? I don't see an exception on master or 1.0.0.

jdfinsf · 2020-02-04T23:04:15Z

<ipython-input-59-8a14c8745bf5> in <module>
      1 df = pd.DataFrame([[123, 'a', 1.0], [123, 'b', 2.0]], columns=["id", "category", "value"]).set_index(["id", "category"])
----> 2 df[df.value < 0].groupby('id').sum()

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
   5805             group_keys=group_keys,
   5806             squeeze=squeeze,
-> 5807             observed=observed,
   5808         )
   5809 

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated)
    407                 sort=sort,
    408                 observed=observed,
--> 409                 mutated=self.mutated,
    410             )
    411 

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
    623                 in_axis=in_axis,
    624             )
--> 625             if not isinstance(gpr, Grouping)
    626             else gpr
    627         )

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/groupby/grouper.py in __init__(self, index, grouper, obj, name, level, sort, observed, in_axis)
    285                 self._codes,
    286                 self._group_index,
--> 287             ) = index._get_grouper_for_level(self.grouper, level)
    288 
    289         # a passed Grouper like, directly get the grouper in the same way

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/indexes/multi.py in _get_grouper_for_level(self, mapper, level)
   1265             grouper = level_index.take(codes)
   1266         else:
-> 1267             grouper = level_index.take(codes, fill_value=True)
   1268 
   1269         return grouper, codes, level_index

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/indexes/base.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
    759                 cls_name = type(self).__name__
    760                 raise ValueError(
--> 761                     f"Unable to fill values because {cls_name} cannot contain NA"
    762                 )
    763             taken = self.values.take(indices)

ValueError: Unable to fill values because Int64Index cannot contain NA

jorisvandenbossche · 2020-02-05T08:04:29Z

I also couldn't reproduce in my default environment, and the main difference seemed to be the python version. And indeed, creating an env with python=3.6 and pandas=1.0, I get this as well:

In [1]: df = pd.DataFrame( 
   ...:     [[123, "a", 1.0], [123, "b", 2.0]], 
   ...:     columns=["id", "category", "value"] 
   ...: ) 
   ...: df = df.set_index(["id", "category"]) 
   ...: df[df.value < 0].groupby("id").sum()   
...
~/miniconda3/envs/pandas10py36/lib/python3.6/site-packages/pandas/core/indexes/base.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
    759                 cls_name = type(self).__name__
    760                 raise ValueError(
--> 761                     f"Unable to fill values because {cls_name} cannot contain NA"
    762                 )
    763             taken = self.values.take(indices)

ValueError: Unable to fill values because Int64Index cannot contain NA

jorisvandenbossche · 2020-02-05T08:23:39Z

Sorry, didn't test it correctly in my 1.0.0 env, I just get it there as well, regardless of the python version.
But, it's already fixed on master, and it seems #29243 is responsible for this.
Will backport that PR, and then we can add an additional test.

jorisvandenbossche · 2020-02-05T09:07:36Z

Backporting the fix in #31689 and adding a test + whatsnew in #31690

jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version labels Feb 5, 2020

jorisvandenbossche added this to the 1.0.1 milestone Feb 5, 2020

jorisvandenbossche mentioned this issue Feb 5, 2020

TST: add test for regression in groupby with empty MultiIndex level #31690

Merged

jorisvandenbossche closed this as completed in #31690 Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby fails when MultiIndex contains Int64Index in an empty DataFrame in 1.0.0 #31670

groupby fails when MultiIndex contains Int64Index in an empty DataFrame in 1.0.0 #31670

jdfinsf commented Feb 4, 2020 •

edited

Loading

TomAugspurger commented Feb 4, 2020

jdfinsf commented Feb 4, 2020

jorisvandenbossche commented Feb 5, 2020

jorisvandenbossche commented Feb 5, 2020

jorisvandenbossche commented Feb 5, 2020

groupby fails when MultiIndex contains Int64Index in an empty DataFrame in 1.0.0 #31670

groupby fails when MultiIndex contains Int64Index in an empty DataFrame in 1.0.0 #31670

Comments

jdfinsf commented Feb 4, 2020 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Feb 4, 2020

jdfinsf commented Feb 4, 2020

jorisvandenbossche commented Feb 5, 2020

jorisvandenbossche commented Feb 5, 2020

jorisvandenbossche commented Feb 5, 2020

jdfinsf commented Feb 4, 2020 •

edited

Loading

Output of `pd.show_versions()`