Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby fails when MultiIndex contains Int64Index in an empty DataFrame in 1.0.0 #31670

Closed
jdfinsf opened this issue Feb 4, 2020 · 5 comments · Fixed by #31690
Closed

groupby fails when MultiIndex contains Int64Index in an empty DataFrame in 1.0.0 #31670

jdfinsf opened this issue Feb 4, 2020 · 5 comments · Fixed by #31690
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@jdfinsf
Copy link

jdfinsf commented Feb 4, 2020

Code Sample, a copy-pastable example if possible

df = pd.DataFrame(
    [[123, "a", 1.0], [123, "b", 2.0]],
    columns=["id", "category", "value"]
)
df = df.set_index(["id", "category"])
df[df.value < 0].groupby("id").sum()

Problem description

When groupby is over a Int64Index in a MultiIndex for an empty DataFrame, the groupby fails with error: ValueError: Unable to fill values because Int64Index cannot contain NA

Expected Output

The groupby should not raise an error, instead the code above should output an empty DataFrame as would happen for df[df.value < 0].groupby("category").sum()

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.6.9.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.7.0
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.0.0
numpy            : 1.18.1
pytz             : 2019.2
dateutil         : 2.8.0
pip              : 18.1
setuptools       : 41.6.0
Cython           : None
pytest           : 5.1.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.3 (dt dec pq3 ext lo64)
jinja2           : 2.10.1
IPython          : 7.8.0
pandas_datareader: None
bs4              : 4.8.0
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.13.0
pytables         : None
pytest           : 5.1.2
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.6
tables           : None
tabulate         : 0.8.6
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None
numba            : None
@TomAugspurger
Copy link
Contributor

Can you post a traceback? I don't see an exception on master or 1.0.0.

@jdfinsf
Copy link
Author

jdfinsf commented Feb 4, 2020

<ipython-input-59-8a14c8745bf5> in <module>
      1 df = pd.DataFrame([[123, 'a', 1.0], [123, 'b', 2.0]], columns=["id", "category", "value"]).set_index(["id", "category"])
----> 2 df[df.value < 0].groupby('id').sum()

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
   5805             group_keys=group_keys,
   5806             squeeze=squeeze,
-> 5807             observed=observed,
   5808         )
   5809 

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated)
    407                 sort=sort,
    408                 observed=observed,
--> 409                 mutated=self.mutated,
    410             )
    411 

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
    623                 in_axis=in_axis,
    624             )
--> 625             if not isinstance(gpr, Grouping)
    626             else gpr
    627         )

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/groupby/grouper.py in __init__(self, index, grouper, obj, name, level, sort, observed, in_axis)
    285                 self._codes,
    286                 self._group_index,
--> 287             ) = index._get_grouper_for_level(self.grouper, level)
    288 
    289         # a passed Grouper like, directly get the grouper in the same way

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/indexes/multi.py in _get_grouper_for_level(self, mapper, level)
   1265             grouper = level_index.take(codes)
   1266         else:
-> 1267             grouper = level_index.take(codes, fill_value=True)
   1268 
   1269         return grouper, codes, level_index

~/.pyenv/versions/3.6.9/envs/sf36/lib/python3.6/site-packages/pandas/core/indexes/base.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
    759                 cls_name = type(self).__name__
    760                 raise ValueError(
--> 761                     f"Unable to fill values because {cls_name} cannot contain NA"
    762                 )
    763             taken = self.values.take(indices)

ValueError: Unable to fill values because Int64Index cannot contain NA

@jorisvandenbossche
Copy link
Member

I also couldn't reproduce in my default environment, and the main difference seemed to be the python version. And indeed, creating an env with python=3.6 and pandas=1.0, I get this as well:

In [1]: df = pd.DataFrame( 
   ...:     [[123, "a", 1.0], [123, "b", 2.0]], 
   ...:     columns=["id", "category", "value"] 
   ...: ) 
   ...: df = df.set_index(["id", "category"]) 
   ...: df[df.value < 0].groupby("id").sum()   
...
~/miniconda3/envs/pandas10py36/lib/python3.6/site-packages/pandas/core/indexes/base.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
    759                 cls_name = type(self).__name__
    760                 raise ValueError(
--> 761                     f"Unable to fill values because {cls_name} cannot contain NA"
    762                 )
    763             taken = self.values.take(indices)

ValueError: Unable to fill values because Int64Index cannot contain NA

@jorisvandenbossche jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version labels Feb 5, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.0.1 milestone Feb 5, 2020
@jorisvandenbossche
Copy link
Member

Sorry, didn't test it correctly in my 1.0.0 env, I just get it there as well, regardless of the python version.
But, it's already fixed on master, and it seems #29243 is responsible for this.
Will backport that PR, and then we can add an additional test.

@jorisvandenbossche
Copy link
Member

Backporting the fix in #31689 and adding a test + whatsnew in #31690

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants