Skip to content

Commit

Permalink
Fix a size overflow bug in hash groupby (#16053)
Browse files Browse the repository at this point in the history
This PR fixes a size overflow bug discovered by @matal-nvidia. It converts the groupby problem size to `int64_t` so it won't overflow if larger than `INT_MAX / 2` with 50% hash table occupancy.

Unit tests for this scenario will saturate device memory and take longer than necessary, making them likely not worth adding.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Nghia Truong (https://github.com/ttnghia)

URL: #16053
  • Loading branch information
PointKernel authored Jun 18, 2024
1 parent 102d30a commit 231cb71
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 3 deletions.
3 changes: 2 additions & 1 deletion cpp/src/groupby/hash/groupby.cu
Original file line number Diff line number Diff line change
Expand Up @@ -553,7 +553,8 @@ std::unique_ptr<table> groupby(table_view const& keys,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)
{
auto const num_keys = keys.num_rows();
// convert to int64_t to avoid potential overflow with large `keys`
auto const num_keys = static_cast<int64_t>(keys.num_rows());
auto const null_keys_are_equal = null_equality::EQUAL;
auto const has_null = nullate::DYNAMIC{cudf::has_nested_nulls(keys)};

Expand Down
3 changes: 2 additions & 1 deletion java/src/test/java/ai/rapids/cudf/TableTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -7838,11 +7838,12 @@ void testSumWithStrings() {
.build();
Table result = t.groupBy(0).aggregate(
GroupByAggregation.sum().onColumn(1));
Table sorted = result.orderBy(OrderByArg.asc(0));
Table expected = new Table.TestBuilder()
.column("1-URGENT", "3-MEDIUM")
.column(5289L + 5303L, 5203L + 5206L)
.build()) {
assertTablesAreEqual(expected, result);
assertTablesAreEqual(expected, sorted);
}
}

Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1308,7 +1308,7 @@ def pipe(self, func, *args, **kwargs):
To get the difference between each groups maximum and minimum value
in one pass, you can do
>>> df.groupby('A').pipe(lambda x: x.max() - x.min())
>>> df.groupby('A', sort=True).pipe(lambda x: x.max() - x.min())
B
A
a 2
Expand Down

0 comments on commit 231cb71

Please sign in to comment.