[BUG] reset_index on the MultiIndex resulting from calling stack loses index data #3583

beckernick · 2019-12-11T16:03:35Z

Calling reset_index to get the values from the multi-index resulting from a stack call doesn't return the data from all levels of a multi-index. This can make it hard to associate data in the stacked dataframe from it's original grouping (e.g., what row it was in).

import cudf
import nvstrings

s = cudf.Series(['this is an example', 'this one is too'])
ps = s.to_pandas()

df = cudf.DataFrame({'a':['a','b','b','c','a','d','d'], 'b':[0,0,0,0,1,1,1]})
pdf = df.to_pandas()

Note that calling reset_index after the groupby aggregation gives back the values from both levels of the multi-index.

print(df.groupby(['a', 'b']).size().reset_index())
   a  b  0
0  a  0  1
1  a  1  1
2  b  0  2
3  c  0  1
4  d  1  2

Calling stack also creates a multi-index:

print(s.str.split(' ').stack())
0  0       this
   1         is
   2         an
   3    example
1  0       this
   1        one
   2         is
   3        too
dtype: object

But we only get the more granular level data back when calling reset_index.

print(s.str.split(' ').stack().reset_index())
         0  1
0     this  0
1       is  1
2       an  2
3  example  3
4     this  0
5      one  1
6       is  2
7      too  3

We should get this (ignore the expand argument):

print(ps.str.split(' ', expand=True).stack().reset_index())
   level_0  level_1        0
0        0        0     this
1        0        1       is
2        0        2       an
3        0        3  example
4        1        0     this
5        1        1      one
6        1        2       is
7        1        3      too

The text was updated successfully, but these errors were encountered:

beckernick · 2020-05-26T19:42:09Z

In commit 65ba848 (2020-05-25), we no longer can call stack on a string column as the updated interleave_columns required fixed width types.

import cudf
s = cudf.Series(['this is an example', 'this one is too'])
print(s.str.split(' ').stack())
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-11cfab2f71af> in <module>
      1 import cudf
      2 s = cudf.Series(['this is an example', 'this one is too'])
----> 3 print(s.str.split(' ').stack())

/raid/nicholasb/miniconda3/envs/rapids--20200526-cuda102-1003/lib/python3.7/site-packages/cudf/core/dataframe.py in stack(self, level, dropna)
   4723         )
   4724 
-> 4725         data_col = libcudf.reshape.interleave_columns(homogenized)
   4726 
   4727         result = Series(data=data_col, index=new_index)

cudf/_lib/reshape.pyx in cudf._lib.reshape.interleave_columns()

RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1590463697876/work/cpp/src/reshape/interleave_columns.cu:31: Only fixed-width types are supported in interleave_columns.

beckernick · 2021-02-10T15:23:42Z

As of 2021-02-08, we get the following behavior with stack, multiindex, and reset_index:

import pandas as pd
import cudf

s = cudf.Series(['this is an example', 'this one is too'])
ps = s.to_pandas()

print(ps.str.split(' ', expand=True).stack())
print(s.str.split(' ', expand=True).stack())
0  0       this
   1         is
   2         an
   3    example
1  0       this
   1        one
   2         is
   3        too
dtype: object
0  1
0  0       this
   1         is
   2         an
   3    example
1  0       this
   1        one
   2         is
   3        too
dtype: object

The MultiIndex in cudf is named, while in pandas it is not.

print(ps.str.split(' ', expand=True).stack().index)
print(s.str.split(' ', expand=True).stack().index)
MultiIndex([(0, 0),
            (0, 1),
            (0, 2),
            (0, 3),
            (1, 0),
            (1, 1),
            (1, 2),
            (1, 3)],
           )
MultiIndex([(0, 0),
            (0, 1),
            (0, 2),
            (0, 3),
            (1, 0),
            (1, 1),
            (1, 2),
            (1, 3)],
           names=[0, 1])

Perhaps this may be related to the following error with stack + reset_index:

import pandas as pd
import cudf

s = cudf.Series(['this is an example', 'this one is too'])
ps = s.to_pandas()

print(ps.str.split(' ', expand=True).stack().reset_index())
print(s.str.split(' ', expand=True).stack().reset_index())
   level_0  level_1        0
0        0        0     this
1        0        1       is
2        0        2       an
3        0        3  example
4        1        0     this
5        1        1      one
6        1        2       is
7        1        3      too
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-34-c353c63c6102> in <module>
      6 
      7 print(ps.str.split(' ', expand=True).stack().reset_index())
----> 8 print(s.str.split(' ', expand=True).stack().reset_index())

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/site-packages/cudf/core/series.py in reset_index(self, drop, inplace)
    658                     "to create a DataFrame"
    659                 )
--> 660             return self.to_frame().reset_index(drop=drop)
    661         else:
    662             if inplace is True:

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/site-packages/cudf/core/dataframe.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   2970                 reversed(names), reversed(index_columns)
   2971             ):
-> 2972                 result.insert(0, name, index_column)
   2973         result.index = RangeIndex(len(self))
   2974         if inplace:

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76 

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/site-packages/cudf/core/dataframe.py in insert(self, loc, name, value)
   3058         num_cols = len(self._data)
   3059         if name in self._data:
-> 3060             raise NameError(f"duplicated column name {name}")
   3061 
   3062         if loc < 0:

NameError: duplicated column name 0

conda list | grep "rapids\|blazing\|dask\|distr\|pandas\|numpy\|arrow"
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210208:
arrow-cpp                 1.0.1           py37h2318771_14_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
blazingsql                0.18.0a0                 pypi_0    pypi
cudf                      0.18.0a210208   cuda_10.2_py37_gda0e794749_249    rapidsai-nightly
cuml                      0.18.0a210208   cuda10.2_py37_gc1a744776_121    rapidsai-nightly
dask                      2021.2.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.2.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 0.18.0a210208           py37_58    rapidsai-nightly
dask-cudf                 0.18.0a210208   py37_gda0e794749_249    rapidsai-nightly
distributed               2021.2.0         py37h89c1867_0    conda-forge
libcudf                   0.18.0a210208   cuda10.2_gda0e794749_249    rapidsai-nightly
libcuml                   0.18.0a210208   cuda10.2_gc1a744776_121    rapidsai-nightly
libcumlprims              0.18.0a201203   cuda10.2_gff080f3_0    rapidsai-nightly
librmm                    0.18.0a210208   cuda10.2_g729918c_33    rapidsai-nightly
numpy                     1.19.5           py37haa41c4c_1    conda-forge
pandas                    1.1.5            py37hdc94413_0    conda-forge
pyarrow                   1.0.1           py37hbeecfa9_14_cuda    conda-forge
rmm                       0.18.0a210208   cuda_10.2_py37_g729918c_33    rapidsai-nightly
ucx                       1.9.0+gcd9efd3       cuda10.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.18.0a210208   py37_gcd9efd3_15    rapidsai-nightly

kkraus14 · 2021-04-22T00:25:04Z

@beckernick any idea if this is still an issue?

shwina · 2021-06-02T21:05:36Z

This is still an issue. It happens because s.reset_index() in cuDF does s.to_frame().reset_index().

s.to_frame() for an unnamed Series unfortunately uses 0 as the column name (both in Pandas and cuDF):

In [24]: s = cudf.Series([1, 2, 3])

In [25]: s
Out[25]:
0    1
1    2
2    3
dtype: int64

In [26]: s.to_frame()
Out[26]:
   0
0  1
1  2
2  3

galipremsagar · 2021-07-16T03:54:29Z

Fixes to the 3 APIs causing issues with MultiIndex is ready: #8753

) Fixes: #3583 This PR contains fixes for : - [x] `stack`: Where the MultiIndex names are not being assigned correctly in `from_table` call. - [x] `dropna`: Where the MultiIndex names are not being preserved after a `libcudf` API call. - [x] `reset_index`: Where the MultiIndex level names are not being materialized correctly when the index is reset. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Michael Wang (https://github.com/isVoid) - Ashwin Srinath (https://github.com/shwina) URL: #8753

beckernick added bug Something isn't working Needs Triage Need team to review and classify labels Dec 11, 2019

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jan 21, 2020

galipremsagar self-assigned this Jul 16, 2021

galipremsagar mentioned this issue Jul 16, 2021

[REVIEW] Fix issues with MultiIndex in dropna, stack & reset_index #8753

Merged

3 tasks

rapids-bot bot closed this as completed in #8753 Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] reset_index on the MultiIndex resulting from calling stack loses index data #3583

[BUG] reset_index on the MultiIndex resulting from calling stack loses index data #3583

beckernick commented Dec 11, 2019 •

edited

Loading

beckernick commented May 26, 2020

beckernick commented Feb 10, 2021

kkraus14 commented Apr 22, 2021

shwina commented Jun 2, 2021

galipremsagar commented Jul 16, 2021

[BUG] reset_index on the MultiIndex resulting from calling stack loses index data #3583

[BUG] reset_index on the MultiIndex resulting from calling stack loses index data #3583

Comments

beckernick commented Dec 11, 2019 • edited Loading

beckernick commented May 26, 2020

beckernick commented Feb 10, 2021

kkraus14 commented Apr 22, 2021

shwina commented Jun 2, 2021

galipremsagar commented Jul 16, 2021

beckernick commented Dec 11, 2019 •

edited

Loading