[BUG] reset_index on multi-column groupby results in an error #1801

skmatti · 2019-05-20T21:02:21Z

Describe the bug
Invoking reset_index after performing a multi-cloumn groupby operation as shown in the script results in an error with 0.7. But works fine with cudf 0.6. Also, single column groupby works as expected with both 0.6 and 0.7

Steps/Code to reproduce bug

import cudf
df = cudf.DataFrame({'x':[0, 0, 1, 1], 'y':[1, 1, 0, 0], 'z': [1, 2, 4, 3]})
print(df.groupby(['x', 'y']).sum().reset_index())

Output:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-f0aefd2147dc> in <module>
      1 import cudf
      2 df = cudf.DataFrame({'x':[0, 0, 1, 1], 'y':[1, 1, 0, 0], 'z': [1, 2, 4, 3]})
----> 3 print(df.groupby(['x', 'y']).sum().reset_index())

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in reset_index(self, drop)
    777             name = self.index.name or 'index'
    778             out = DataFrame()
--> 779             out[name] = self.index
    780             for c in self.columns:
    781                 out[c] = self[c]

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in __setitem__(self, name, col)
    286             self._cols[name] = self._prepare_series_for_add(col)
    287         else:
--> 288             self.add_column(name, col)
    289 
    290     def __delitem__(self, name):

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in add_column(self, name, data, forceindex)
    913         if isinstance(data, GeneratorType):
    914             data = Series(data)
--> 915         series = self._prepare_series_for_add(data, forceindex=forceindex)
    916         series.name = name
    917         self._cols[name] = series

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/dataframe.py in _prepare_series_for_add(self, col, forceindex)
    884         # This won't handle 0 dimensional arrays which should be okay
    885         SCALAR = np.isscalar(col)
--> 886         series = Series(col) if not SCALAR else col
    887         self._sanitize_columns(series)
    888         series = self._sanitize_values(series, SCALAR)

~/anaconda3/envs/cudf7/lib/python3.6/site-packages/cudf/dataframe/series.py in __init__(self, data, index, name, nan_as_null, dtype)
     85         if index is not None and not isinstance(index, Index):
     86             index = as_index(index)
---> 87         assert isinstance(data, columnops.TypedColumnBase)
     88         self._column = data
     89         self._index = RangeIndex(len(data)) if index is None else index

AssertionError:

Environment details:
cudf 0.7

The text was updated successfully, but these errors were encountered:

ayushdg · 2019-05-20T21:45:16Z

Hey @skmatti this probably due to multi-index support with cudf. Hopefully after #1740 this should be resolved.

As a workaround print(df.groupby(['x', 'y'],as_index=False).sum().reset_index()) adding as_index=False should serve your purpose which would give a multicolumn output as wanted.

beckernick · 2019-05-22T01:53:29Z

This is not resolved by #1740. The current issue is that the MultiIndex does not have a name attribute, which causes reset_index to fail.

import cudf
df = cudf.DataFrame({'x':[0, 0, 1, 1], 'y':[1, 1, 0, 0], 'z': [1, 2, 4, 3]})
print(df.groupby(['x', 'y']).sum().reset_index())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-f0aefd2147dc> in <module>
      1 import cudf
      2 df = cudf.DataFrame({'x':[0, 0, 1, 1], 'y':[1, 1, 0, 0], 'z': [1, 2, 4, 3]})
----> 3 print(df.groupby(['x', 'y']).sum().reset_index())

/conda/envs/rapids/lib/python3.7/site-packages/cudf-0.8.0a1+120.gff317270.dirty-py3.7-linux-x86_64.egg/cudf/dataframe/dataframe.py in reset_index(self, drop)
    777     def reset_index(self, drop=False):
    778         if not drop:
--> 779             name = self.index.name or 'index'
    780             out = DataFrame()
    781             out[name] = self.index

AttributeError: 'MultiIndex' object has no attribute 'name'

Some potential solutions could include tweaking the logic of reset_index to not rely on self.name.index being None or existing before checking if it exists as an attribute, creating branching logic for MultiIndexes, or allowing a name attribute in the MultiIndex.

pandas allows MultiIndexes to have names, so that's probably the best solution.

iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
x = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
x.name = '3'
print(x.name)
3

ayushdg · 2019-05-22T02:54:14Z

Hey @beckernick makes sense. After solving the name issue both this issue and #1740 will eventually run into the same issue here:
781 out[name] = self.index

Where we cannot assign a column to be a multi-index.

When calling reset_index on a df with multi index pandas converts it to a multicolumn output.
One solution could be to have a special case for checking index and then iterating through all the columns in multi index.codes and assigning a new column to out based on every column in codes?

thomcom · 2019-06-14T13:18:23Z

This is fixed by #1542

kkraus14 · 2019-08-14T15:11:32Z

Fixed by #1542

skmatti added Needs Triage Need team to review and classify bug Something isn't working labels May 20, 2019

beckernick added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 22, 2019

beckernick changed the title ~~[BUG]reset_index on multi-column groupby results in an error~~ [BUG] reset_index on multi-column groupby results in an error May 22, 2019

thomcom self-assigned this May 24, 2019

thomcom mentioned this issue Jun 12, 2019

[REVIEW] Python method and bindings for to_csv #1542

Merged

18 tasks

kkraus14 closed this as completed Aug 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] reset_index on multi-column groupby results in an error #1801

[BUG] reset_index on multi-column groupby results in an error #1801

skmatti commented May 20, 2019

ayushdg commented May 20, 2019

beckernick commented May 22, 2019 •

edited

Loading

ayushdg commented May 22, 2019

thomcom commented Jun 14, 2019

kkraus14 commented Aug 14, 2019

[BUG] reset_index on multi-column groupby results in an error #1801

[BUG] reset_index on multi-column groupby results in an error #1801

Comments

skmatti commented May 20, 2019

ayushdg commented May 20, 2019

beckernick commented May 22, 2019 • edited Loading

ayushdg commented May 22, 2019

thomcom commented Jun 14, 2019

kkraus14 commented Aug 14, 2019

beckernick commented May 22, 2019 •

edited

Loading