Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of values=None in pylibcudf GroupBy.get_groups #14998

Merged
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions python/cudf/cudf/_lib/pylibcudf/groupby.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ cdef class GroupByRequest:

cdef class GroupBy:
cdef unique_ptr[groupby] c_obj
cdef Table _keys
cpdef tuple aggregate(self, list requests)
cpdef tuple scan(self, list requests)
cpdef tuple shift(self, Table values, list offset, list fill_values)
Expand Down
26 changes: 17 additions & 9 deletions python/cudf/cudf/_lib/pylibcudf/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,9 @@ cdef class GroupBy:
sorted keys_are_sorted=sorted.NO
):
self.c_obj.reset(new groupby(keys.view(), null_handling, keys_are_sorted))
# keep a reference to the keys table so it doesn't get
# deallocated from under us:
self._keys = keys

@staticmethod
cdef tuple _parse_outputs(
Expand Down Expand Up @@ -253,26 +256,31 @@ cdef class GroupBy:
Parameters
----------
values : Table, optional
The columns to get group labels for. If not specified, the group
labels for the group keys are returned.
The columns to get group labels for. If not specified,
`None` is returned for the group values.

Returns
-------
Tuple[Table, Table, List[int]]
A tuple of tables containing three items:
- A table of group keys
- A table of group values
- A table of group values or None
- A list of integer offsets into the tables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thought: if we returned a pair instead of a triple when values is None, we would probably reorder these outputs to be keys, offsets, values so that the first two outputs would always be the same. Do we still want to do that even if we do return None, such that it's the last return value that gets discarded? keys, offset, _ = ... seems more natural than keys, _, offsets = ... if we do have to discard an object (or if you're indexing, using a contiguous index set res[0] and res[1]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I've reordered to (offsets, keys, values) as it feels awkward to separate keys and values.

"""

cdef groups c_groups
if values:
c_groups = dereference(self.c_obj).get_groups(values.view())
return (
Table.from_libcudf(move(c_groups.keys)),
Table.from_libcudf(move(c_groups.values)),
c_groups.offsets,
)
else:
# c_groups.values is nullptr
c_groups = dereference(self.c_obj).get_groups()

return (
Table.from_libcudf(move(c_groups.keys)),
Table.from_libcudf(move(c_groups.values)),
c_groups.offsets,
)
return (
Table.from_libcudf(move(c_groups.keys)),
None,
c_groups.offsets,
)
8 changes: 8 additions & 0 deletions python/cudf/cudf/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -3756,3 +3756,11 @@ def test_group_by_value_counts_with_count_column():
df = cudf.DataFrame({"a": [1, 5, 3], "count": [2, 5, 2]})
with pytest.raises(ValueError):
df.groupby("a", as_index=False).value_counts()


def test_groupby_internal_groups_empty(gdf):
# test that we don't segfault when calling the internal
# .groups() method with an empty list:
gb = gdf.groupby("y")._groupby
_, grouped_vals, _ = gb.groups([])
assert grouped_vals is None
Loading