Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unstack() support for non-multiindexed dataframes #7054

Merged
merged 6 commits into from
Jan 6, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Dec 29, 2020

Closes #6694

When unstack() receives a dataframe with "single" index, returns a series to match pandas behavior.

@isVoid isVoid requested a review from a team as a code owner December 29, 2020 23:04
@isVoid isVoid self-assigned this Dec 29, 2020
@isVoid isVoid added Python Affects Python cuDF API. bug Something isn't working non-breaking Non-breaking change labels Dec 29, 2020
@isVoid isVoid added the 3 - Ready for Review Ready for review by team label Dec 30, 2020
@codecov
Copy link

codecov bot commented Dec 30, 2020

Codecov Report

Merging #7054 (57fdd4a) into branch-0.18 (4a1e465) will increase coverage by 0.02%.
The diff coverage is 81.81%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.18    #7054      +/-   ##
===============================================
+ Coverage        82.09%   82.12%   +0.02%     
===============================================
  Files               97       97              
  Lines            16477    16487      +10     
===============================================
+ Hits             13527    13540      +13     
+ Misses            2950     2947       -3     
Impacted Files Coverage Δ
python/cudf/cudf/core/reshape.py 91.00% <81.81%> (-0.04%) ⬇️
python/cudf/cudf/io/csv.py 93.33% <0.00%> (-0.42%) ⬇️
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/hash_vocab_utils.py 100.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 90.76% <0.00%> (+0.05%) ⬆️
python/cudf/cudf/core/abc.py 91.48% <0.00%> (+4.25%) ⬆️
python/cudf/cudf/utils/gpu_utils.py 58.53% <0.00%> (+4.87%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a1e465...57fdd4a. Read the comment docs.


Unstacking single level index dataframe:

>>> df.unstack(['b', 'd']).unstack()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example is a little opaque - it's sometimes difficult to visualize exactly what the result of unstack should be for even a single level, and here I find it a little hard to connect to dots through the chained operation. I'd recommend an example that starts with a dataframe with a single index and shows the result of unstacking that dataframe into a series instead.

"is not supported"
)
if isinstance(df, cudf.DataFrame):
res = df.T.stack(dropna=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this pass the typecasting behavior off to transpose? Should we check the dtypes and possibly error here?

Copy link
Contributor Author

@isVoid isVoid Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like both transpose.pyx and libcudf::transpose.cu checks whether all columns have the same datatype. A clear exception gets raised if the columns are of different types. Should we check again here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would support checking here - imagining what happens here from the user perspective, if I get an error trying to unstack a cuDF dataframe, I might wonder why the transpose code is unhappy.

In general, I think we try and avoid letting libcudf itself serve an error to the user and favor a more surface level python error, usually when I've managed to actually manifest a libcudf error from the python API it means something is very wrong.

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions otherwise LGTM.

)
if isinstance(df, cudf.DataFrame):
dtype = df._columns[0].dtype
if any(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't matter much in most cases but I think it's slightly more efficient to use a single loop that raises if df._columns[i].dtype != dtype as opposed to creating a list with all the booleans and then doing an any reduction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised an issue since this comes up in a bunch of places #7067

@harrism harrism changed the title Adds unstack() support for non-multiindexed dataframes Add unstack() support for non-multiindexed dataframes Jan 5, 2021
@isVoid
Copy link
Contributor Author

isVoid commented Jan 5, 2021

rerun tests

@isVoid isVoid added 6 - Okay to Auto-Merge and removed 3 - Ready for Review Ready for review by team labels Jan 6, 2021
@rapids-bot rapids-bot bot merged commit 1930432 into rapidsai:branch-0.18 Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA]use cuDF dataframe cannot use unstack() function
2 participants