Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of values=None in pylibcudf GroupBy.get_groups #14998

Merged

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Feb 7, 2024

A small bug in our previous implementation leads to a segfault when calling .get_groups() with no values. Thankfully, the cuDF Python API always calls this function with a value, but it's possible pylibcudf consumers will not.

@shwina shwina added the bug Something isn't working label Feb 7, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 7, 2024
@shwina shwina added non-breaking Non-breaking change Python Affects Python cuDF API. and removed Python Affects Python cuDF API. labels Feb 7, 2024
@shwina shwina marked this pull request as ready for review February 7, 2024 20:38
@shwina shwina requested a review from a team as a code owner February 7, 2024 20:39
@shwina shwina requested review from wence- and charlesbluca February 7, 2024 20:39
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it, can we save a reference to the Table in the GroupBy object?

The columns to get group labels for. If not specified, the group
labels for the group keys are returned.
The columns to get group labels for. If not specified,
an empty table is returned for the group values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return an empty table or should we not return anything at all? On one hand I tend to prefer the return type not changing based on arguments, on the other I would prefer not creating an extra object for no reason and there is precedent for this kind of thing in numpy APIs (like np.unique).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preference. I changed t oreturn None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also saved a reference to keys table.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought for your consideration, but I don't have a strong opinion so feel free to address however you want, just wanted to raise the question. Thanks for the fix!

Comment on lines 267 to 268
- A table of group values or None
- A list of integer offsets into the tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thought: if we returned a pair instead of a triple when values is None, we would probably reorder these outputs to be keys, offsets, values so that the first two outputs would always be the same. Do we still want to do that even if we do return None, such that it's the last return value that gets discarded? keys, offset, _ = ... seems more natural than keys, _, offsets = ... if we do have to discard an object (or if you're indexing, using a contiguous index set res[0] and res[1]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I've reordered to (offsets, keys, values) as it feels awkward to separate keys and values.

@shwina
Copy link
Contributor Author

shwina commented Feb 8, 2024

/merge

@rapids-bot rapids-bot bot merged commit d855d0e into rapidsai:branch-24.04 Feb 8, 2024
68 checks passed
@vyasr vyasr added the pylibcudf Issues specific to the pylibcudf package label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants