Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update groupby::hash to use new row operators for keys #10770

Merged
merged 33 commits into from
May 25, 2022

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented May 2, 2022

Related to #8039 and #10181

Contributes to #10186

This PR updates groupby::hash to use new row operators. It gets rid of the current "flattened nested column" logic and allows groupby::hash to handle LIST and STRUCT keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments.

It becomes a breaking PR since the updated groupby::hash will treat inner nulls as equal when top-level nulls are excluded
while the current behavior treats inner nulls as unequal.

@PointKernel PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels May 2, 2022
@PointKernel PointKernel self-assigned this May 2, 2022
@codecov
Copy link

codecov bot commented May 9, 2022

Codecov Report

Merging #10770 (efd497e) into branch-22.06 (e0d94f3) will increase coverage by 0.03%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10770      +/-   ##
================================================
+ Coverage         86.28%   86.32%   +0.03%     
================================================
  Files               144      144              
  Lines             22654    22668      +14     
================================================
+ Hits              19548    19569      +21     
+ Misses             3106     3099       -7     
Impacted Files Coverage Δ
python/cudf/cudf/utils/ioutils.py 79.47% <0.00%> (-0.13%) ⬇️
python/cudf/cudf/io/json.py 97.56% <0.00%> (ø)
python/dask_cudf/dask_cudf/tests/test_groupby.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/groupby.py 97.42% <0.00%> (+0.02%) ⬆️
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/dask_cudf/dask_cudf/core.py 73.62% <0.00%> (+0.26%) ⬆️
python/cudf/cudf/core/column/numerical.py 96.17% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe9aaeb...efd497e. Read the comment docs.

@github-actions github-actions bot added the CMake CMake build issue label May 10, 2022
@PointKernel PointKernel marked this pull request as ready for review May 11, 2022 01:08
@PointKernel PointKernel requested a review from a team as a code owner May 11, 2022 01:08
@PointKernel
Copy link
Member Author

@mythrocks requesting your review as well since you authored the flattened nested column work.

@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 11, 2022
@PointKernel PointKernel requested a review from devavret May 24, 2022 19:21
@PointKernel PointKernel added breaking Breaking change and removed non-breaking Non-breaking change labels May 24, 2022
@PointKernel
Copy link
Member Author

Setting the current work as breaking since the behavior is changed when nulls are excluded. See #10770 (comment)

@PointKernel PointKernel changed the title Update groupby::hash to use new row operators Update groupby::hash to use new row operators for keys May 24, 2022
@PointKernel
Copy link
Member Author

rerun tests

@jrhemstad
Copy link
Contributor

It becomes a breaking PR due to two reasons:

  • The updated groupby::hash matches pandas' behavior, i.e. when nulls are excluded, the output will drop top-level nulls for struct/list but include struct/list containing null elements.
  • Nulls are always treated as equal.

Wait, what? Why do we need to make these changes? I don't see these reflected in the top-level groupby API either.

@PointKernel
Copy link
Member Author

PR description updated to provide clearer breaking information.

@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit dbd2b08 into rapidsai:branch-22.06 May 25, 2022
rapids-bot bot pushed a commit that referenced this pull request Aug 3, 2022
Closes #10952 

After #10770 was merged there are no more uses of `unflatten_nested_columns`. This pr removes `unflatten_nested_columns` and adjusts the tests accordingly.

Authors:
  - Srikar Vanavasam (https://github.com/SrikarVanavasam)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11421
@PointKernel PointKernel deleted the groupby-new-row-operator branch November 16, 2022 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants