Update `groupby::hash` to use new row operators for keys #10770

PointKernel · 2022-05-02T22:24:38Z

Contributes to #10186

This PR updates groupby::hash to use new row operators. It gets rid of the current "flattened nested column" logic and allows groupby::hash to handle LIST and STRUCT keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments.

It becomes a breaking PR since the updated groupby::hash will treat inner nulls as equal when top-level nulls are excluded
while the current behavior treats inner nulls as unequal.

…-row-operator

cpp/src/groupby/hash/groupby.cu

codecov · 2022-05-09T19:34:08Z

Codecov Report

Merging #10770 (efd497e) into branch-22.06 (e0d94f3) will increase coverage by 0.03%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10770      +/-   ##
================================================
+ Coverage         86.28%   86.32%   +0.03%     
================================================
  Files               144      144              
  Lines             22654    22668      +14     
================================================
+ Hits              19548    19569      +21     
+ Misses             3106     3099       -7

Impacted Files	Coverage Δ
python/cudf/cudf/utils/ioutils.py	`79.47% <0.00%> (-0.13%)`	⬇️
python/cudf/cudf/io/json.py	`97.56% <0.00%> (ø)`
python/dask_cudf/dask_cudf/tests/test_groupby.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/groupby.py	`97.42% <0.00%> (+0.02%)`	⬆️
python/cudf/cudf/core/dataframe.py	`93.78% <0.00%> (+0.04%)`	⬆️
python/cudf/cudf/core/column/string.py	`88.78% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.79% <0.00%> (+0.22%)`	⬆️
python/dask_cudf/dask_cudf/core.py	`73.62% <0.00%> (+0.26%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`96.17% <0.00%> (+0.29%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe9aaeb...efd497e. Read the comment docs.

PointKernel · 2022-05-11T01:10:47Z

@mythrocks requesting your review as well since you authored the flattened nested column work.

PointKernel · 2022-05-24T19:44:49Z

Setting the current work as breaking since the behavior is changed when nulls are excluded. See #10770 (comment)

cpp/src/groupby/hash/groupby.cu

cpp/src/groupby/groupby.cu

cpp/src/groupby/hash/groupby.cu

PointKernel · 2022-05-24T21:13:50Z

rerun tests

jrhemstad · 2022-05-24T22:16:01Z

It becomes a breaking PR due to two reasons:

The updated groupby::hash matches pandas' behavior, i.e. when nulls are excluded, the output will drop top-level nulls for struct/list but include struct/list containing null elements.

Nulls are always treated as equal.

Wait, what? Why do we need to make these changes? I don't see these reflected in the top-level groupby API either.

PointKernel · 2022-05-25T11:34:13Z

PR description updated to provide clearer breaking information.

PointKernel · 2022-05-25T11:35:01Z

@gpucibot merge

Closes #10952 After #10770 was merged there are no more uses of `unflatten_nested_columns`. This pr removes `unflatten_nested_columns` and adjusts the tests accordingly. Authors: - Srikar Vanavasam (https://github.com/SrikarVanavasam) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11421

PointKernel added 2 commits May 2, 2022 13:14

Use new row hasher and comparator

d335153

Get rid of Map template

b227ec4

PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels May 2, 2022

PointKernel self-assigned this May 2, 2022

PointKernel added 3 commits May 5, 2022 11:24

Fix a bug: update the lifecycle of preprocessed table

3deb64b

Merge remote-tracking branch 'upstream/branch-22.06' into groupby-new…

1c7d9f4

…-row-operator

Get rid of flattened columns

bf94a8d

jrhemstad reviewed May 9, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

jrhemstad reviewed May 9, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

PointKernel added 2 commits May 9, 2022 12:59

Fix a bug: keys always have nulls

80d8f87

Pass shared_ptr of preprocessed table by value

5f704ec

Add structs argmax unit tests

6de7c0b

PointKernel mentioned this pull request May 10, 2022

[FEA] Story - Supporting row operators on nested types #10186

Closed

Add basic list tests

70d740f

github-actions bot added the CMake CMake build issue label May 10, 2022

PointKernel added 2 commits May 10, 2022 19:42

Add all null input tests

1a70016

Add lists with nulls tests

965eba4

PointKernel marked this pull request as ready for review May 11, 2022 01:08

PointKernel requested a review from a team as a code owner May 11, 2022 01:08

PointKernel requested review from karthikeyann, jrhemstad and mythrocks May 11, 2022 01:08

PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 11, 2022

Update unit tests to exercise null elements in list keys

c8b1aab

PointKernel requested a review from devavret May 24, 2022 19:21

PointKernel added breaking Breaking change and removed non-breaking Non-breaking change labels May 24, 2022

ttnghia reviewed May 24, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Show resolved Hide resolved

ttnghia reviewed May 24, 2022

View reviewed changes

cpp/src/groupby/groupby.cu Show resolved Hide resolved

ttnghia reviewed May 24, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Show resolved Hide resolved

PointKernel mentioned this pull request May 24, 2022

[FEA] Deprecate unflatten_nested_columns #10952

Closed

Minor cleanup

002ad40

ttnghia reviewed May 24, 2022

View reviewed changes

cpp/src/groupby/hash/groupby.cu Outdated Show resolved Hide resolved

PointKernel changed the title ~~Update groupby::hash to use new row operators~~ Update groupby::hash to use new row operators for keys May 24, 2022

PointKernel added 2 commits May 24, 2022 16:49

Remove unused header

b76677c

Throw when null structs are excluded

efd497e

devavret approved these changes May 25, 2022

View reviewed changes

rapids-bot bot merged commit dbd2b08 into rapidsai:branch-22.06 May 25, 2022

ttnghia mentioned this pull request Jun 1, 2022

[FEA] Experiment try to unify implementation of non-nested types and nested types in cudf::contains #11016

Open

bdice mentioned this pull request Jun 6, 2022

[FEA] Support lists as groupby keys #8039

Closed

This was referenced Jun 15, 2022

Refactor semi_anti_join #11100

Merged

Support duplicate_keep_option in cudf::distinct #11052

Merged

SrikarVanavasam mentioned this pull request Aug 1, 2022

Deprecate unflatten_nested_columns #11421

Merged

GregoryKimball mentioned this pull request Oct 3, 2022

[FEA] Implement full support for nested types #11844

Closed

PointKernel deleted the groupby-new-row-operator branch November 16, 2022 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `groupby::hash` to use new row operators for keys #10770

Update `groupby::hash` to use new row operators for keys #10770

PointKernel commented May 2, 2022 •

edited

Loading

codecov bot commented May 9, 2022 •

edited

Loading

PointKernel commented May 11, 2022

PointKernel commented May 24, 2022

PointKernel commented May 24, 2022

jrhemstad commented May 24, 2022

PointKernel commented May 25, 2022

PointKernel commented May 25, 2022

Update groupby::hash to use new row operators for keys #10770

Update groupby::hash to use new row operators for keys #10770

Conversation

PointKernel commented May 2, 2022 • edited Loading

codecov bot commented May 9, 2022 • edited Loading

Codecov Report

PointKernel commented May 11, 2022

PointKernel commented May 24, 2022

PointKernel commented May 24, 2022

jrhemstad commented May 24, 2022

PointKernel commented May 25, 2022

PointKernel commented May 25, 2022

Update `groupby::hash` to use new row operators for keys #10770

Update `groupby::hash` to use new row operators for keys #10770

PointKernel commented May 2, 2022 •

edited

Loading

codecov bot commented May 9, 2022 •

edited

Loading