-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] reset_index on the MultiIndex resulting from calling stack loses index data #3583
Comments
In commit 65ba848 (2020-05-25), we no longer can call import cudf
s = cudf.Series(['this is an example', 'this one is too'])
print(s.str.split(' ').stack())
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-12-11cfab2f71af> in <module>
1 import cudf
2 s = cudf.Series(['this is an example', 'this one is too'])
----> 3 print(s.str.split(' ').stack())
/raid/nicholasb/miniconda3/envs/rapids--20200526-cuda102-1003/lib/python3.7/site-packages/cudf/core/dataframe.py in stack(self, level, dropna)
4723 )
4724
-> 4725 data_col = libcudf.reshape.interleave_columns(homogenized)
4726
4727 result = Series(data=data_col, index=new_index)
cudf/_lib/reshape.pyx in cudf._lib.reshape.interleave_columns()
RuntimeError: cuDF failure at: /conda/conda-bld/libcudf_1590463697876/work/cpp/src/reshape/interleave_columns.cu:31: Only fixed-width types are supported in interleave_columns. |
As of 2021-02-08, we get the following behavior with stack, multiindex, and reset_index: import pandas as pd
import cudf
s = cudf.Series(['this is an example', 'this one is too'])
ps = s.to_pandas()
print(ps.str.split(' ', expand=True).stack())
print(s.str.split(' ', expand=True).stack())
0 0 this
1 is
2 an
3 example
1 0 this
1 one
2 is
3 too
dtype: object
0 1
0 0 this
1 is
2 an
3 example
1 0 this
1 one
2 is
3 too
dtype: object The MultiIndex in cudf is named, while in pandas it is not. print(ps.str.split(' ', expand=True).stack().index)
print(s.str.split(' ', expand=True).stack().index)
MultiIndex([(0, 0),
(0, 1),
(0, 2),
(0, 3),
(1, 0),
(1, 1),
(1, 2),
(1, 3)],
)
MultiIndex([(0, 0),
(0, 1),
(0, 2),
(0, 3),
(1, 0),
(1, 1),
(1, 2),
(1, 3)],
names=[0, 1]) Perhaps this may be related to the following error with stack + reset_index: import pandas as pd
import cudf
s = cudf.Series(['this is an example', 'this one is too'])
ps = s.to_pandas()
print(ps.str.split(' ', expand=True).stack().reset_index())
print(s.str.split(' ', expand=True).stack().reset_index())
level_0 level_1 0
0 0 0 this
1 0 1 is
2 0 2 an
3 0 3 example
4 1 0 this
5 1 1 one
6 1 2 is
7 1 3 too
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-34-c353c63c6102> in <module>
6
7 print(ps.str.split(' ', expand=True).stack().reset_index())
----> 8 print(s.str.split(' ', expand=True).stack().reset_index())
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/site-packages/cudf/core/series.py in reset_index(self, drop, inplace)
658 "to create a DataFrame"
659 )
--> 660 return self.to_frame().reset_index(drop=drop)
661 else:
662 if inplace is True:
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/site-packages/cudf/core/dataframe.py in reset_index(self, level, drop, inplace, col_level, col_fill)
2970 reversed(names), reversed(index_columns)
2971 ):
-> 2972 result.insert(0, name, index_column)
2973 result.index = RangeIndex(len(self))
2974 if inplace:
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/contextlib.py in inner(*args, **kwds)
72 def inner(*args, **kwds):
73 with self._recreate_cm():
---> 74 return func(*args, **kwds)
75 return inner
76
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20210125/lib/python3.7/site-packages/cudf/core/dataframe.py in insert(self, loc, name, value)
3058 num_cols = len(self._data)
3059 if name in self._data:
-> 3060 raise NameError(f"duplicated column name {name}")
3061
3062 if loc < 0:
NameError: duplicated column name 0
|
@beckernick any idea if this is still an issue? |
This is still an issue. It happens because
In [24]: s = cudf.Series([1, 2, 3])
In [25]: s
Out[25]:
0 1
1 2
2 3
dtype: int64
In [26]: s.to_frame()
Out[26]:
0
0 1
1 2
2 3 |
Fixes to the 3 APIs causing issues with MultiIndex is ready: #8753 |
) Fixes: #3583 This PR contains fixes for : - [x] `stack`: Where the MultiIndex names are not being assigned correctly in `from_table` call. - [x] `dropna`: Where the MultiIndex names are not being preserved after a `libcudf` API call. - [x] `reset_index`: Where the MultiIndex level names are not being materialized correctly when the index is reset. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Michael Wang (https://github.com/isVoid) - Ashwin Srinath (https://github.com/shwina) URL: #8753
Calling reset_index to get the values from the multi-index resulting from a
stack
call doesn't return the data from all levels of a multi-index. This can make it hard to associate data in the stacked dataframe from it's original grouping (e.g., what row it was in).Note that calling
reset_index
after the groupby aggregation gives back the values from both levels of the multi-index.Calling
stack
also creates a multi-index:But we only get the more granular level data back when calling
reset_index
.We should get this (ignore the expand argument):
The text was updated successfully, but these errors were encountered: