Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "unflatten" of columns flattened via flatten_nested_columns(): #8956

Merged
merged 16 commits into from
Aug 20, 2021

Conversation

mythrocks
Copy link
Contributor

cudf::flatten_nested_columns() flattens out STRUCT columns into their
constituent member columns, and includes the STRUCT's validity information
as a BOOL8 column.
E.g. STRUCT_1< STRUCT_2< A, B >, C > is flattened to:
1. Null Vector for STRUCT_1
2. Null Vector for STRUCT_2
3. Member STRUCT_2::A
4. Member STRUCT_2::B
5. Member STRUCT_1::C

This commit adds an unflatten_nested_columns() method to convert back
from a flattened representation to the nested columns.

`cudf::flatten_nested_columns()` flattens out STRUCT columns into their
constituent member columns, and includes the STRUCT's validity information
as a BOOL8 column.
E.g. STRUCT_1< STRUCT_2< A, B >, C > is flattened to:
     1. Null Vector for STRUCT_1
     2. Null Vector for STRUCT_2
     3. Member STRUCT_2::A
     4. Member STRUCT_2::B
     5. Member STRUCT_1::C

This commit adds an `unflatten_nested_columns()` method to convert back
from a flattened representation to the nested columns.
@mythrocks mythrocks added feature request New feature or request non-breaking Non-breaking change labels Aug 4, 2021
@mythrocks mythrocks self-assigned this Aug 4, 2021
@mythrocks mythrocks requested a review from a team as a code owner August 4, 2021 18:25
@mythrocks mythrocks requested review from ttnghia and jrhemstad August 4, 2021 18:25
@github-actions github-actions bot added CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels Aug 4, 2021
@codecov
Copy link

codecov bot commented Aug 4, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@5869264). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head bdbf762 differs from pull request most recent head 94bb184. Consider uploading reports for the commit 94bb184 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #8956   +/-   ##
===============================================
  Coverage                ?   10.73%           
===============================================
  Files                   ?      114           
  Lines                   ?    18666           
  Branches                ?        0           
===============================================
  Hits                    ?     2003           
  Misses                  ?    16663           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5869264...94bb184. Read the comment docs.

@mythrocks mythrocks added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Aug 5, 2021
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments. Overall, I would like to recommend more polishing in style/formatting. Will get back again later to see if there's anything more to suggest :)

cpp/src/structs/utilities.cpp Outdated Show resolved Hide resolved
cpp/src/structs/utilities.cpp Outdated Show resolved Hide resolved
cpp/src/structs/utilities.cpp Outdated Show resolved Hide resolved
cpp/src/structs/utilities.cpp Outdated Show resolved Hide resolved
cpp/src/structs/utilities.cpp Outdated Show resolved Hide resolved
@mythrocks mythrocks requested a review from devavret August 10, 2021 05:12
@mythrocks
Copy link
Contributor Author

mythrocks commented Aug 10, 2021

Took @jrhemstad off the hook. Added @devavret, on whom I'll also be relying for the related structs groupby review.

@mythrocks mythrocks removed the request for review from jrhemstad August 10, 2021 21:38
@mythrocks
Copy link
Contributor Author

rerun tests

@mythrocks
Copy link
Contributor Author

The failure does not seem to be related to this change. From the failure logs:

18:13:51 gtests/ARROW_IO_SOURCE_TEST: error while loading shared libraries: libarrow.so.400: cannot open shared object file: No such file or directory

I'm not sure why this is happening.

@ttnghia
Copy link
Contributor

ttnghia commented Aug 13, 2021

The CI error seems to be fixed.
Rerun tests.

@mythrocks
Copy link
Contributor Author

rerun tests

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cpp/src/structs/utilities.cpp Outdated Show resolved Hide resolved
cpp/src/structs/utilities.cpp Outdated Show resolved Hide resolved
1. Minor formatting.
2. Reuse `is_struct()` in a couple of places.
3. Modified test numerical values.
@mythrocks mythrocks requested a review from ttnghia August 19, 2021 22:27
@mythrocks
Copy link
Contributor Author

Thanks for the reviews, chaps. I'll check this in after CI tests pass.

@mythrocks
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8c92812 into rapidsai:branch-21.10 Aug 20, 2021
mythrocks added a commit to mythrocks/cudf that referenced this pull request Aug 20, 2021
This commit adds support for `STRUCT` columns in `groupby`. This should now allow for `groupby` aggregations to work when any of the grouping columns are `STRUCT`, including nested `STRUCTS`.

Note: `List` columns are still not supported on `groupby`, even as members of `STRUCT` columns, at any level of nesting. Only `STRUCT`, `STRUCT<STRUCT>`, etc. are currently supported.

Depends on rapidsai#8956 (i.e. `unflatten_nested_columns()`).
rapids-bot bot pushed a commit that referenced this pull request Aug 26, 2021
This commit adds support for `STRUCT` columns in `groupby`. This should now allow for groupby aggregations to work when any of the grouping columns are `STRUCT`, including nested `STRUCTS`.

Note: List columns are still not supported on `groupby`, even as members of `STRUCT` columns, at any level of nesting. Only `STRUCT`, `STRUCT<STRUCT>`, etc. are currently supported.

Depends on #8956 (i.e. `unflatten_nested_columns()`).

Authors:
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #9024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants