[BUG] groupby aggregation does not match pandas in pandas-compat mode when sort=False
#16908
Labels
bug
Something isn't working
sort=False
#16908
Describe the bug
When
mode.pandas_compatible
isTrue
, we do various things to match pandas ordering guarantees in groupbys and joins. Specifically when we request a groupby-aggregation withsort=False
, such that the returned object has keys appearing in input-table order, cudf produces the incorrect result.Steps/Code to reproduce bug
Expected behavior
Should match pandas
Additional context
The relevant code is
cudf/python/cudf/cudf/core/groupby/groupby.py
Lines 723 to 743 in 5780c4d
The problematic lines are
cudf/python/cudf/cudf/core/groupby/groupby.py
Lines 738 to 743 in 5780c4d
We compute a "wanted" order of the group keys and then perform a join with the obtained keys, and then gather the result through the resulting join gather map.
For this map to be the permutation between the wanted order and the obtained order, the left table's gather map must be the identity. However, this is not guaranteed by libcudf join routines.
To fix this, one can sort the
indices
column by the left gather map before callingresult.take
.See also #16893
The text was updated successfully, but these errors were encountered: