Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] groupby aggregation does not match pandas in pandas-compat mode when sort=False #16908

Closed
wence- opened this issue Sep 25, 2024 · 0 comments · Fixed by #16942
Closed

[BUG] groupby aggregation does not match pandas in pandas-compat mode when sort=False #16908

wence- opened this issue Sep 25, 2024 · 0 comments · Fixed by #16942
Assignees
Labels
bug Something isn't working

Comments

@wence-
Copy link
Contributor

wence- commented Sep 25, 2024

Describe the bug

When mode.pandas_compatible is True, we do various things to match pandas ordering guarantees in groupbys and joins. Specifically when we request a groupby-aggregation with sort=False, such that the returned object has keys appearing in input-table order, cudf produces the incorrect result.

Steps/Code to reproduce bug

import cudf

import numpy as np

df = cudf.DataFrame({
        "random": np.random.rand(300),
        "groups": np.random.randint(100, size=300),
    }
)

cudf.set_option("mode.pandas_compatible", True)

print(df.groupby("groups", sort=False).agg({"random": "sum"}))
#           random
# groups          
# 20      1.245007
# 33      1.312368
# 75      0.408444
# 25      1.139875
# 1       0.735769
# ...          ...
# 17      1.007452
# 73      1.319591
# 36      1.001927
# 76      2.007808
# 91      1.806850

# [95 rows x 1 columns]

print(df.to_pandas().groupby("groups", sort=False).agg({"random": "sum"}))
#           random
# groups          
# 44      0.386077
# 12      2.480210
# 30      1.415172
# 48      1.443282
# 49      2.888763
# ...          ...
# 97      0.246967
# 69      0.713967
# 8       0.830442
# 80      0.948887
# 39      0.278244

# [95 rows x 1 columns]

Expected behavior

Should match pandas

Additional context

The relevant code is

if cudf.get_option(
"mode.pandas_compatible"
) and not libgroupby._is_all_scan_aggregate(normalized_aggs):
# Even with `sort=False`, pandas guarantees that
# groupby preserves the order of rows within each group.
left_cols = list(
self.grouping.keys.drop_duplicates()._data.columns
)
right_cols = list(result_index._data.columns)
join_keys = [
_match_join_keys(lcol, rcol, "left")
for lcol, rcol in zip(left_cols, right_cols)
]
# TODO: In future, see if we can centralize
# logic else where that has similar patterns.
join_keys = map(list, zip(*join_keys))
_, indices = libcudf.join.join(
*join_keys,
how="left",
)
result = result.take(indices)

The problematic lines are

join_keys = map(list, zip(*join_keys))
_, indices = libcudf.join.join(
*join_keys,
how="left",
)
result = result.take(indices)

We compute a "wanted" order of the group keys and then perform a join with the obtained keys, and then gather the result through the resulting join gather map.

For this map to be the permutation between the wanted order and the obtained order, the left table's gather map must be the identity. However, this is not guaranteed by libcudf join routines.

To fix this, one can sort the indices column by the left gather map before calling result.take.

See also #16893

@wence- wence- added the bug Something isn't working label Sep 25, 2024
@wence- wence- self-assigned this Sep 26, 2024
wence- added a commit to wence-/cudf that referenced this issue Sep 27, 2024
When sort=False is requested in groupby aggregations and pandas
compatibility mode is enabled, we are on the hook to produce the
grouped aggregation result in an order which matches the input key
order. We previously nearly did this, but the reordering relied on
the (incorrect) assumption that when joining two tables with a left
join, the resulting gather map for the left table is the identity.
This is not the case.

To fix this, we must permute the right (result) table gather map by
the ordering that makes the left map the identity (AKA, sort by key
with the left map as keys) before permuting the result.

While here, replace the (bounds-checking) IndexedFrame.take call with
usage of the internal (non-bounds-checking) _gather method. This
avoids a redundant reduction over the indices, since by construction
they are in bounds.

- Closes rapidsai#16908
@rapids-bot rapids-bot bot closed this as completed in 9b2f892 Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant