Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] reset_index on the MultiIndex resulting from calling stack loses index data #3583

Closed
beckernick opened this issue Dec 11, 2019 · 5 comments · Fixed by #8753
Closed
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Dec 11, 2019

Calling reset_index to get the values from the multi-index resulting from a stack call doesn't return the data from all levels of a multi-index. This can make it hard to associate data in the stacked dataframe from it's original grouping (e.g., what row it was in).

import cudf
import nvstrings

s = cudf.Series(['this is an example', 'this one is too'])
ps = s.to_pandas()

df = cudf.DataFrame({'a':['a','b','b','c','a','d','d'], 'b':[0,0,0,0,1,1,1]})
pdf = df.to_pandas()

Note that calling reset_index after the groupby aggregation gives back the values from both levels of the multi-index.

print(df.groupby(['a', 'b']).size().reset_index())
   a  b  0
0  a  0  1
1  a  1  1
2  b  0  2
3  c  0  1
4  d  1  2

Calling stack also creates a multi-index:

print(s.str.split(' ').stack())
0  0       this
   1         is
   2         an
   3    example
1  0       this
   1        one
   2         is
   3        too
dtype: object

But we only get the more granular level data back when calling reset_index.

print(s.str.split(' ').stack().reset_index())
         0  1
0     this  0
1       is  1
2       an  2
3  example  3
4     this  0
5      one  1
6       is  2
7      too  3

We should get this (ignore the expand argument):

print(ps.str.split(' ', expand=True).stack().reset_index())
   level_0  level_1        0
0        0        0     this
1        0        1       is
2        0        2       an
3        0        3  example
4        1        0     this
5        1        1      one
6        1        2       is
7        1        3      too
@beckernick beckernick added bug Something isn't working Needs Triage Need team to review and classify labels Dec 11, 2019
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jan 21, 2020
@beckernick
Copy link
Member Author

In commit 65ba848 (2020-05-25), we no longer can call stack on a string column as the updated interleave_columns required fixed width types.

import cudf
s = cudf.Series(['this is an example', 'this one is too'])
print(s.str.split(' ').stack())
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-11cfab2f71af> in <module>
      1 import cudf
      2 s = cudf.Series(['this is an example', 'this one is too'])
----> 3 print(s.str.split(' ').stack())

/raid/nicholasb/miniconda3/envs/rapids--20200526-cuda102-1003/lib/python3.7/site-packages/cudf/core/dataframe.py in stack(self, level, dropna)
   4723         )
   4724 
-> 4725         data_col = libcudf.reshape.interleave_columns(homogenized)
   4726 
   4727         result = Series(data=data_col, index=new_index)

cudf/_lib/reshape.pyx in cudf._lib.reshape.interleave_columns()

RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1590463697876/work/cpp/src/reshape/interleave_columns.cu:31: Only fixed-width types are supported in interleave_columns.

@beckernick
Copy link
Member Author

As of 2021-02-08, we get the following behavior with stack, multiindex, and reset_index:

import pandas as pd
import cudf

s = cudf.Series(['this is an example', 'this one is too'])
ps = s.to_pandas()

print(ps.str.split(' ', expand=True).stack())
print(s.str.split(' ', expand=True).stack())
0  0       this
   1         is
   2         an
   3    example
1  0       this
   1        one
   2         is
   3        too
dtype: object
0  1
0  0       this
   1         is
   2         an
   3    example
1  0       this
   1        one
   2         is
   3        too
dtype: object

The MultiIndex in cudf is named, while in pandas it is not.

print(ps.str.split(' ', expand=True).stack().index)
print(s.str.split(' ', expand=True).stack().index)
MultiIndex([(0, 0),
            (0, 1),
            (0, 2),
            (0, 3),
            (1, 0),
            (1, 1),
            (1, 2),
            (1, 3)],
           )
MultiIndex([(0, 0),
            (0, 1),
            (0, 2),
            (0, 3),
            (1, 0),
            (1, 1),
            (1, 2),
            (1, 3)],
           names=[0, 1])

Perhaps this may be related to the following error with stack + reset_index:

import pandas as pd
import cudfs = cudf.Series(['this is an example', 'this one is too'])
ps = s.to_pandas()
​
print(ps.str.split(' ', expand=True).stack().reset_index())
print(s.str.split(' ', expand=True).stack().reset_index())
   level_0  level_1        0
0        0        0     this
1        0        1       is
2        0        2       an
3        0        3  example
4        1        0     this
5        1        1      one
6        1        2       is
7        1        3      too
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-34-c353c63c6102> in <module>
      6 
      7 print(ps.str.split(' ', expand=True).stack().reset_index())
----> 8 print(s.str.split(' ', expand=True).stack().reset_index())

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/site-packages/cudf/core/series.py in reset_index(self, drop, inplace)
    658                     "to create a DataFrame"
    659                 )
--> 660             return self.to_frame().reset_index(drop=drop)
    661         else:
    662             if inplace is True:

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/site-packages/cudf/core/dataframe.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   2970                 reversed(names), reversed(index_columns)
   2971             ):
-> 2972                 result.insert(0, name, index_column)
   2973         result.index = RangeIndex(len(self))
   2974         if inplace:

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/contextlib.py in inner(*args, **kwds)
     72         def inner(*args, **kwds):
     73             with self._recreate_cm():
---> 74                 return func(*args, **kwds)
     75         return inner
     76 

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/site-packages/cudf/core/dataframe.py in insert(self, loc, name, value)
   3058         num_cols = len(self._data)
   3059         if name in self._data:
-> 3060             raise NameError(f"duplicated column name {name}")
   3061 
   3062         if loc < 0:

NameError: duplicated column name 0
conda list | grep "rapids\|blazing\|dask\|distr\|pandas\|numpy\|arrow"
# packages in environment at /raid/nicholasb/miniconda3/envs/rapids-gpubdb-20210208:
arrow-cpp                 1.0.1           py37h2318771_14_cuda    conda-forge
arrow-cpp-proc            3.0.0                      cuda    conda-forge
blazingsql                0.18.0a0                 pypi_0    pypi
cudf                      0.18.0a210208   cuda_10.2_py37_gda0e794749_249    rapidsai-nightly
cuml                      0.18.0a210208   cuda10.2_py37_gc1a744776_121    rapidsai-nightly
dask                      2021.2.0           pyhd8ed1ab_0    conda-forge
dask-core                 2021.2.0           pyhd8ed1ab_0    conda-forge
dask-cuda                 0.18.0a210208           py37_58    rapidsai-nightly
dask-cudf                 0.18.0a210208   py37_gda0e794749_249    rapidsai-nightly
distributed               2021.2.0         py37h89c1867_0    conda-forge
libcudf                   0.18.0a210208   cuda10.2_gda0e794749_249    rapidsai-nightly
libcuml                   0.18.0a210208   cuda10.2_gc1a744776_121    rapidsai-nightly
libcumlprims              0.18.0a201203   cuda10.2_gff080f3_0    rapidsai-nightly
librmm                    0.18.0a210208   cuda10.2_g729918c_33    rapidsai-nightly
numpy                     1.19.5           py37haa41c4c_1    conda-forge
pandas                    1.1.5            py37hdc94413_0    conda-forge
pyarrow                   1.0.1           py37hbeecfa9_14_cuda    conda-forge
rmm                       0.18.0a210208   cuda_10.2_py37_g729918c_33    rapidsai-nightly
ucx                       1.9.0+gcd9efd3       cuda10.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.18.0a210208   py37_gcd9efd3_15    rapidsai-nightly

@kkraus14
Copy link
Collaborator

@beckernick any idea if this is still an issue?

@shwina
Copy link
Contributor

shwina commented Jun 2, 2021

This is still an issue. It happens because s.reset_index() in cuDF does s.to_frame().reset_index().

s.to_frame() for an unnamed Series unfortunately uses 0 as the column name (both in Pandas and cuDF):

In [24]: s = cudf.Series([1, 2, 3])

In [25]: s
Out[25]:
0    1
1    2
2    3
dtype: int64

In [26]: s.to_frame()
Out[26]:
   0
0  1
1  2
2  3

@galipremsagar
Copy link
Contributor

Fixes to the 3 APIs causing issues with MultiIndex is ready: #8753

rapids-bot bot pushed a commit that referenced this issue Jul 16, 2021
)

Fixes: #3583 

This PR contains fixes for :

- [x] `stack`: Where the MultiIndex names are not being assigned correctly in `from_table` call.
- [x] `dropna`: Where the MultiIndex names are not being preserved after a `libcudf` API call.
- [x] `reset_index`: Where the MultiIndex level names are not being materialized correctly when the index is reset.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Ashwin Srinath (https://github.com/shwina)

URL: #8753
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants