Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] reset_index on multi-column groupby results in an error #1801

Closed
skmatti opened this issue May 20, 2019 · 5 comments
Closed

[BUG] reset_index on multi-column groupby results in an error #1801

skmatti opened this issue May 20, 2019 · 5 comments
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@skmatti
Copy link

skmatti commented May 20, 2019

Describe the bug
Invoking reset_index after performing a multi-cloumn groupby operation as shown in the script results in an error with 0.7. But works fine with cudf 0.6. Also, single column groupby works as expected with both 0.6 and 0.7

Steps/Code to reproduce bug

import cudf
df = cudf.DataFrame({'x':[0, 0, 1, 1], 'y':[1, 1, 0, 0], 'z': [1, 2, 4, 3]})
print(df.groupby(['x', 'y']).sum().reset_index())

Output:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-f0aefd2147dc> in <module>
      1 import cudf
      2 df = cudf.DataFrame({'x':[0, 0, 1, 1], 'y':[1, 1, 0, 0], 'z': [1, 2, 4, 3]})
----> 3 print(df.groupby(['x', 'y']).sum().reset_index())

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in reset_index(self, drop)
    777             name = self.index.name or 'index'
    778             out = DataFrame()
--> 779             out[name] = self.index
    780             for c in self.columns:
    781                 out[c] = self[c]

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in __setitem__(self, name, col)
    286             self._cols[name] = self._prepare_series_for_add(col)
    287         else:
--> 288             self.add_column(name, col)
    289 
    290     def __delitem__(self, name):

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in add_column(self, name, data, forceindex)
    913         if isinstance(data, GeneratorType):
    914             data = Series(data)
--> 915         series = self._prepare_series_for_add(data, forceindex=forceindex)
    916         series.name = name
    917         self._cols[name] = series

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in _prepare_series_for_add(self, col, forceindex)
    884         # This won't handle 0 dimensional arrays which should be okay
    885         SCALAR = np.isscalar(col)
--> 886         series = Series(col) if not SCALAR else col
    887         self._sanitize_columns(series)
    888         series = self._sanitize_values(series, SCALAR)

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/series.py in __init__(self, data, index, name, nan_as_null, dtype)
     85         if index is not None and not isinstance(index, Index):
     86             index = as_index(index)
---> 87         assert isinstance(data, columnops.TypedColumnBase)
     88         self._column = data
     89         self._index = RangeIndex(len(data)) if index is None else index

AssertionError:

Environment details:
cudf 0.7

@skmatti skmatti added Needs Triage Need team to review and classify bug Something isn't working labels May 20, 2019
@ayushdg
Copy link
Member

ayushdg commented May 20, 2019

Hey @skmatti this probably due to multi-index support with cudf. Hopefully after #1740 this should be resolved.

As a workaround print(df.groupby(['x', 'y'],as_index=False).sum().reset_index()) adding as_index=False should serve your purpose which would give a multicolumn output as wanted.

@beckernick
Copy link
Member

beckernick commented May 22, 2019

This is not resolved by #1740. The current issue is that the MultiIndex does not have a name attribute, which causes reset_index to fail.

import cudf
df = cudf.DataFrame({'x':[0, 0, 1, 1], 'y':[1, 1, 0, 0], 'z': [1, 2, 4, 3]})
print(df.groupby(['x', 'y']).sum().reset_index())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-f0aefd2147dc> in <module>
      1 import cudf
      2 df = cudf.DataFrame({'x':[0, 0, 1, 1], 'y':[1, 1, 0, 0], 'z': [1, 2, 4, 3]})
----> 3 print(df.groupby(['x', 'y']).sum().reset_index())

/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.8.0a1+120.gff317270.dirty-py3.7-linux-x86_64.egg/cudf/dataframe/dataframe.py in reset_index(self, drop)
    777     def reset_index(self, drop=False):
    778         if not drop:
--> 779             name = self.index.name or 'index'
    780             out = DataFrame()
    781             out[name] = self.index

AttributeError: 'MultiIndex' object has no attribute 'name'

Some potential solutions could include tweaking the logic of reset_index to not rely on self.name.index being None or existing before checking if it exists as an attribute, creating branching logic for MultiIndexes, or allowing a name attribute in the MultiIndex.

pandas allows MultiIndexes to have names, so that's probably the best solution.

iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
x = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
x.name = '3'
print(x.name)
3

@beckernick beckernick added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 22, 2019
@beckernick beckernick changed the title [BUG]reset_index on multi-column groupby results in an error [BUG] reset_index on multi-column groupby results in an error May 22, 2019
@ayushdg
Copy link
Member

ayushdg commented May 22, 2019

Hey @beckernick makes sense. After solving the name issue both this issue and #1740 will eventually run into the same issue here:
781 out[name] = self.index

Where we cannot assign a column to be a multi-index.

When calling reset_index on a df with multi index pandas converts it to a multicolumn output.
One solution could be to have a special case for checking index and then iterating through all the columns in multi index.codes and assigning a new column to out based on every column in codes?

@thomcom
Copy link
Contributor

thomcom commented Jun 14, 2019

This is fixed by #1542

@kkraus14
Copy link
Collaborator

Fixed by #1542

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

5 participants