Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dseries.struct.explode #9086

Merged
merged 4 commits into from
Sep 22, 2021
Merged

Add dseries.struct.explode #9086

merged 4 commits into from
Sep 22, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Aug 20, 2021

Closes #8660

Per discussions in thread #8872 , this PR adds a struct-accessor member function to provide a lateral view to a struct type series.

Example:

>>> import cudf, dask_cudf as dgd
>>> ds = dgd.from_cudf(cudf.Series(
...     [{'a': 42, 'b': 'str1', 'c': [-1]},
...      {'a': 0,  'b': 'str2', 'c': [400, 500]},
...      {'a': 7,  'b': '',     'c': []}]), npartitions=2)
>>> ds.struct.explode().compute()
    a     b           c
0  42  str1        [-1]
1   0  str2  [400, 500]
2   7                []

@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 20, 2021
@isVoid isVoid self-assigned this Aug 20, 2021
@isVoid isVoid added the 2 - In Progress Currently a work in progress label Aug 20, 2021
@isVoid isVoid marked this pull request as ready for review August 24, 2021 21:31
@isVoid isVoid requested a review from a team as a code owner August 24, 2021 21:31
@isVoid isVoid added 3 - Ready for Review Ready for review by team feature request New feature or request non-breaking Non-breaking change and removed 2 - In Progress Currently a work in progress labels Aug 24, 2021
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together @isVoid ! Nice work - the implementation is nice and clean. I just have a minor suggestion for the new test (and a docstring nit-pick).

python/dask_cudf/dask_cudf/accessors.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/tests/test_accessor.py Outdated Show resolved Hide resolved
@isVoid
Copy link
Contributor Author

isVoid commented Aug 25, 2021

I noticed one of the cases (for nested struct columns) failed because the field names were not properly reconstructed. Investigating.

@isVoid
Copy link
Contributor Author

isVoid commented Aug 25, 2021

I think it's the nested field names got dropped upon constructing a nested type dask cudf object, as this behavior is observed from plain constructing a dask_cudf object:

>>> ds = dask_cudf.from_cudf(cudf.Series([{'a': 123, 'b':{'c': 456}}]), 2)
>>> ds.compute()
0    {'a': 123, 'b': {'0': 456}}
dtype: struct

This reminds me of a similar issue we had in cudf:

rapids-bot bot pushed a commit that referenced this pull request Aug 27, 2021
Closes #9121 

Child column type metadata is applied after column is sliced. This resolves the issues of missing field names for nested struct columns in `__getitem__()`. 

In the process of working on this, I also ran into some issue with `StructColumn.to_arrow()`. This blocks proper testing of the behavior because `assert_eq` requires comparing the object on host.

Unblocks #9086

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #9131
@isVoid
Copy link
Contributor Author

isVoid commented Aug 27, 2021

rerun tests

@codecov
Copy link

codecov bot commented Aug 27, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@8075199). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 5c7f72e differs from pull request most recent head 61bc39d. Consider uploading reports for the commit 61bc39d to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9086   +/-   ##
===============================================
  Coverage                ?   10.84%           
===============================================
  Files                   ?      116           
  Lines                   ?    18781           
  Branches                ?        0           
===============================================
  Hits                    ?     2037           
  Misses                  ?    16744           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8075199...61bc39d. Read the comment docs.

@isVoid isVoid requested a review from rjzamora August 27, 2021 22:30
Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks!

@isVoid isVoid added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 5, 2021
@galipremsagar
Copy link
Contributor

rerun tests

@galipremsagar
Copy link
Contributor

@gpucibot merge

@galipremsagar
Copy link
Contributor

rerun tests

@galipremsagar
Copy link
Contributor

Not re-triggering the CI so that @rapidsai/ops can look into the logs here.

@galipremsagar
Copy link
Contributor

rerun tests

1 similar comment
@galipremsagar
Copy link
Contributor

rerun tests

@galipremsagar
Copy link
Contributor

@gpucibot merge

@galipremsagar
Copy link
Contributor

rerun tests

1 similar comment
@galipremsagar
Copy link
Contributor

rerun tests

@rapids-bot rapids-bot bot merged commit 10fd071 into rapidsai:branch-21.10 Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Explode struct column into multiple columns with Dask-cuDF
3 participants