[BUG] groupby aggregation does not match pandas in pandas-compat mode when `sort=False` #16908

wence- · 2024-09-25T11:39:12Z

Describe the bug

When mode.pandas_compatible is True, we do various things to match pandas ordering guarantees in groupbys and joins. Specifically when we request a groupby-aggregation with sort=False, such that the returned object has keys appearing in input-table order, cudf produces the incorrect result.

Steps/Code to reproduce bug

import cudf

import numpy as np

df = cudf.DataFrame({
        "random": np.random.rand(300),
        "groups": np.random.randint(100, size=300),
    }
)

cudf.set_option("mode.pandas_compatible", True)

print(df.groupby("groups", sort=False).agg({"random": "sum"}))
#           random
# groups          
# 20      1.245007
# 33      1.312368
# 75      0.408444
# 25      1.139875
# 1       0.735769
# ...          ...
# 17      1.007452
# 73      1.319591
# 36      1.001927
# 76      2.007808
# 91      1.806850

# [95 rows x 1 columns]

print(df.to_pandas().groupby("groups", sort=False).agg({"random": "sum"}))
#           random
# groups          
# 44      0.386077
# 12      2.480210
# 30      1.415172
# 48      1.443282
# 49      2.888763
# ...          ...
# 97      0.246967
# 69      0.713967
# 8       0.830442
# 80      0.948887
# 39      0.278244

# [95 rows x 1 columns]

Expected behavior

Should match pandas

Additional context

The relevant code is

cudf/python/cudf/cudf/core/groupby/groupby.py

Lines 723 to 743 in 5780c4d

    
           if cudf.get_option( 
        
               "mode.pandas_compatible" 
        
           ) and not libgroupby._is_all_scan_aggregate(normalized_aggs): 
        
               # Even with `sort=False`, pandas guarantees that 
        
               # groupby preserves the order of rows within each group. 
        
               left_cols = list( 
        
                   self.grouping.keys.drop_duplicates()._data.columns 
        
               ) 
        
               right_cols = list(result_index._data.columns) 
        
               join_keys = [ 
        
                   _match_join_keys(lcol, rcol, "left") 
        
                   for lcol, rcol in zip(left_cols, right_cols) 
        
               ] 
        
               # TODO: In future, see if we can centralize 
        
               # logic else where that has similar patterns. 
        
               join_keys = map(list, zip(*join_keys)) 
        
               _, indices = libcudf.join.join( 
        
                   *join_keys, 
        
                   how="left", 
        
               ) 
        
               result = result.take(indices)

The problematic lines are

cudf/python/cudf/cudf/core/groupby/groupby.py

Lines 738 to 743 in 5780c4d

    
           join_keys = map(list, zip(*join_keys)) 
        
           _, indices = libcudf.join.join( 
        
               *join_keys, 
        
               how="left", 
        
           ) 
        
           result = result.take(indices)

We compute a "wanted" order of the group keys and then perform a join with the obtained keys, and then gather the result through the resulting join gather map.

For this map to be the permutation between the wanted order and the obtained order, the left table's gather map must be the identity. However, this is not guaranteed by libcudf join routines.

To fix this, one can sort the indices column by the left gather map before calling result.take.

See also #16893

The text was updated successfully, but these errors were encountered:

When sort=False is requested in groupby aggregations and pandas compatibility mode is enabled, we are on the hook to produce the grouped aggregation result in an order which matches the input key order. We previously nearly did this, but the reordering relied on the (incorrect) assumption that when joining two tables with a left join, the resulting gather map for the left table is the identity. This is not the case. To fix this, we must permute the right (result) table gather map by the ordering that makes the left map the identity (AKA, sort by key with the left map as keys) before permuting the result. While here, replace the (bounds-checking) IndexedFrame.take call with usage of the internal (non-bounds-checking) _gather method. This avoids a redundant reduction over the indices, since by construction they are in bounds. - Closes rapidsai#16908

wence- added the bug Something isn't working label Sep 25, 2024

wence- mentioned this issue Sep 25, 2024

Fix order-preservation in cudf-polars groupby #16907

Merged

3 tasks

wence- self-assigned this Sep 26, 2024

wence- mentioned this issue Sep 27, 2024

Fix order-preservation in pandas-compat unsorted groupby #16942

Merged

3 tasks

rapids-bot bot closed this as completed in #16942 Sep 30, 2024

rapids-bot bot closed this as completed in 9b2f892 Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] groupby aggregation does not match pandas in pandas-compat mode when `sort=False` #16908

[BUG] groupby aggregation does not match pandas in pandas-compat mode when `sort=False` #16908

wence- commented Sep 25, 2024

[BUG] groupby aggregation does not match pandas in pandas-compat mode when sort=False #16908

[BUG] groupby aggregation does not match pandas in pandas-compat mode when sort=False #16908

Comments

wence- commented Sep 25, 2024

[BUG] groupby aggregation does not match pandas in pandas-compat mode when `sort=False` #16908

[BUG] groupby aggregation does not match pandas in pandas-compat mode when `sort=False` #16908