Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unstack() support for non-multiindexed dataframes #7054

Merged
merged 6 commits into from
Jan 6, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 31 additions & 5 deletions python/cudf/cudf/core/reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -902,6 +902,11 @@ def unstack(df, level, fill_value=None):
Pivots the specified levels of the index labels of df to the innermost
levels of the columns labels of the result.

* If the index of ``df`` has multiple levels, returns a ``Dataframe`` with
specified level of the index pivoted to the column levels.
* If the index of ``df`` has single level, returns a ``Series`` with all
column levels pivoted to the index levels.

Parameters
----------
df : DataFrame
Expand All @@ -913,7 +918,7 @@ def unstack(df, level, fill_value=None):

Returns
-------
DataFrame with specified index levels pivoted to column levels
Series or DataFrame

Examples
--------
Expand Down Expand Up @@ -964,6 +969,21 @@ def unstack(df, level, fill_value=None):
a
1 5 <NA> 6 <NA> 7
2 <NA> 8 <NA> 9 <NA>

Unstacking single level index dataframe:

>>> df.unstack(['b', 'd']).unstack()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example is a little opaque - it's sometimes difficult to visualize exactly what the result of unstack should be for even a single level, and here I find it a little hard to connect to dots through the chained operation. I'd recommend an example that starts with a dataframe with a single index and shows the result of unstacking that dataframe into a series instead.

b d a
c 1 a 1 5
2 <NA>
d 1 <NA>
2 8
2 b 1 6
2 <NA>
e 1 <NA>
2 9
3 a 1 7
2 <NA>
"""
if fill_value is not None:
raise NotImplementedError("fill_value is not supported.")
Expand All @@ -972,10 +992,16 @@ def unstack(df, level, fill_value=None):
return df
df = df.copy(deep=False)
if not isinstance(df.index, cudf.MultiIndex):
raise NotImplementedError(
"Calling unstack() on a DataFrame without a MultiIndex "
"is not supported"
)
if isinstance(df, cudf.DataFrame):
res = df.T.stack(dropna=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pass the typecasting behavior off to transpose? Should we check the dtypes and possibly error here?

Copy link
Contributor Author

@isVoid isVoid Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like both transpose.pyx and libcudf::transpose.cu checks whether all columns have the same datatype. A clear exception gets raised if the columns are of different types. Should we check again here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would support checking here - imagining what happens here from the user perspective, if I get an error trying to unstack a cuDF dataframe, I might wonder why the transpose code is unhappy.

In general, I think we try and avoid letting libcudf itself serve an error to the user and favor a more surface level python error, usually when I've managed to actually manifest a libcudf error from the python API it means something is very wrong.

# Result's index is a multiindex
res.index.names = tuple(df.columns.names) + df.index.names
return res
else:
raise NotImplementedError(
"Calling unstack() on a Series without a MultiIndex "
"is not supported"
)
else:
columns = df.index._poplevels(level)
index = df.index
Expand Down
37 changes: 36 additions & 1 deletion python/cudf/cudf/tests/test_reshape.py
Original file line number Diff line number Diff line change
Expand Up @@ -402,7 +402,7 @@ def test_pivot_multi_values():
),
],
)
def test_unstack(level):
def test_unstack_multiindex(level):
pdf = pd.DataFrame(
{
"foo": ["one", "one", "one", "two", "two", "two"],
Expand All @@ -417,6 +417,41 @@ def test_unstack(level):
)


@pytest.mark.parametrize(
"data",
[{"A": [1.0, 2.0, 3.0, 4.0, 5.0], "B": [11.0, 12.0, 13.0, 14.0, 15.0]}],
)
@pytest.mark.parametrize(
"index",
[
pd.Index(range(0, 5), name=None),
pd.Index(range(0, 5), name="row_index"),
],
)
@pytest.mark.parametrize(
"col_idx",
[
pd.Index(["a", "b"], name=None),
pd.Index(["a", "b"], name="col_index"),
pd.MultiIndex.from_tuples([("c", 1), ("c", 2)], names=[None, None]),
pd.MultiIndex.from_tuples(
[("c", 1), ("c", 2)], names=["col_index1", "col_index2"]
),
],
)
def test_unstack_index(data, index, col_idx):
pdf = pd.DataFrame(data)
gdf = cudf.from_pandas(pdf)

pdf.index = index
pdf.columns = col_idx

gdf.index = cudf.from_pandas(index)
gdf.columns = cudf.from_pandas(col_idx)

assert_eq(pdf.unstack(), gdf.unstack())


def test_pivot_duplicate_error():
gdf = cudf.DataFrame(
{"a": [0, 1, 2, 2], "b": [1, 2, 3, 3], "d": [1, 2, 3, 4]}
Expand Down