Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add groupby::replace_nulls(replace_policy) api #7118

Merged
merged 39 commits into from
May 24, 2021

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Jan 11, 2021

Part 1 of #4896, follow up of #6907

This PR provides a groupby version of the replace_nulls(replace_policy) function. A regular replace_nulls(replace_policy) operation updates the nulls with the first non-null value that precedes/follows the null. The groupby version is similar, with an exception that the non-null value look-up is bounded by groups.

Here is an example to illustrate the API input/output behavior:

#Input:
keys = [2, 1, 2, 1]
values = [3, 4, NULL, NULL]

#Output, group order is not guaranteed:
sorted_keys = [1, 1, 2, 2]
result = [4, 4, 3, 3]

@isVoid isVoid added non-breaking Non-breaking change libcudf Affects libcudf (C++/CUDA) code. 2 - In Progress Currently a work in progress Cython feature request New feature or request labels Jan 11, 2021
@isVoid
Copy link
Contributor Author

isVoid commented Jan 11, 2021

Requires NVIDIA/thrust#1374

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems sub-optimal to me as it requires deep-copying the group labels just to do a groupby fillna. If instead you could request the fillna directly from the groupby object you could avoid that deep copy.

@isVoid
Copy link
Contributor Author

isVoid commented Jan 12, 2021

This approach seems sub-optimal to me as it requires deep-copying the group labels just to do a groupby fillna. If instead you could request the fillna directly from the groupby object you could avoid that deep copy.

I thought about both - current impl seems less intrusive to the groupby API. I was looking at groupby::get_groups and assumed retrieving groupby meta information is a common practice. But it certainly has performance concerns. Will address on next commit.

@isVoid isVoid changed the title Add replace_nulls with key parameter, segmented null replacements Add groupby::replace_nulls, segmented null replacements Jan 12, 2021
@harrism
Copy link
Member

harrism commented Jan 13, 2021

@isVoid can you add more clarity to the PR description? In particular, I don't understand what "segmented null replacement" means. Are you replacing nulls in only some groups? If it is all groups, then I would think a regular "flat" null replacement would work.

@isVoid isVoid changed the title Add groupby::replace_nulls, segmented null replacements Add groupby::replace_nulls, null value replacements within groups Jan 13, 2021
@isVoid
Copy link
Contributor Author

isVoid commented Jan 13, 2021

@isVoid can you add more clarity to the PR description? In particular, I don't understand what "segmented null replacement" means. Are you replacing nulls in only some groups? If it is all groups, then I would think a regular "flat" null replacement would work.

@harrism Yes, it includes all groups. Does this look good?

@harrism
Copy link
Member

harrism commented Jan 13, 2021

Since the documentation is not written yet, I still don't understand, but I'll wait.

cpp/src/groupby/sort/group_replace_null.cu Outdated Show resolved Hide resolved
cpp/include/cudf/detail/replace/nulls.cuh Outdated Show resolved Hide resolved
@isVoid isVoid marked this pull request as ready for review January 20, 2021 21:04
@isVoid isVoid requested review from a team as code owners January 20, 2021 21:04
@codecov
Copy link

codecov bot commented May 5, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@0ebf7e6). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.06    #7118   +/-   ##
===============================================
  Coverage                ?   82.89%           
===============================================
  Files                   ?      105           
  Lines                   ?    17875           
  Branches                ?        0           
===============================================
  Hits                    ?    14818           
  Misses                  ?     3057           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ebf7e6...7a90f52. Read the comment docs.

Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of very tiny things.

cpp/include/cudf/groupby.hpp Outdated Show resolved Hide resolved
cpp/src/groupby/groupby.cu Outdated Show resolved Hide resolved
@isVoid isVoid requested a review from nvdbaranec May 11, 2021 17:05
@@ -202,6 +207,32 @@ cdef class GroupBy:

return Table(data=result_data, index=grouped_keys)

def replace_nulls(self, Table values, object method):
cdef table_view val_view = values.view()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still scratching my head to make sure but I think this might unnecessarily materialize a RangeIndex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might, but it depends on whether the upstream passes in a RangeIndex in values. Here the meaning of values are the value columns in the groupby.replace_nulls operation, RangeIndex shouldn't exist. (Python interface is yet to be added so it is a bit hard to see).

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question otherwise cython LGTM

@isVoid isVoid self-assigned this May 24, 2021
@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 24, 2021
@jrhemstad
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 6dbf2d5 into rapidsai:branch-21.06 May 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants