Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix order-preservation in cudf-polars groupby #16907

Merged
merged 1 commit into from
Sep 30, 2024

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Sep 25, 2024

Description

When we are requested to maintain order in groupby aggregations we must post-process the result by computing a permutation between the wanted order (of the input keys) and the order returned by the groupby aggregation. To do this, we can perform a join between the two unique key tables. Previously, we assumed that the gather map returned in this join for the left (wanted order) table was the identity. However, this is not guaranteed, in addition to computing the match between the wanted key order and the key order we have, we must also apply the permutation between the left gather map order and the identity.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- added bug Something isn't working 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Sep 25, 2024
@wence- wence- requested a review from a team as a code owner September 25, 2024 11:28
@wence- wence- requested review from vyasr and isVoid September 25, 2024 11:28
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 25, 2024
@wence-
Copy link
Contributor Author

wence- commented Sep 25, 2024

I have pointed this at 24.10, but have marked as do not merge since I don't think this is a critical fix. If others agree, I can retarget to 24.12.

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- You said a similar fix is needed in cudf.pandas? This adds a fair number of steps to the computation. Should we do this at the C++ level?

I am happy to merge this for 24.10 and optimize in 24.12, or just target 24.12.

@wence-
Copy link
Contributor Author

wence- commented Sep 25, 2024

@wence- You said a similar fix is needed in cudf.pandas?

Yes: #16908

This adds a fair number of steps to the computation. Should we do this at the C++ level?

It's one additional sort_by_key (sorting an integer column by an integer key column, which should be a radix sort). So I don't think it's that many more steps.

Note this is only done if the user requests maintain_order=True in the groupby (which is not the default for polars).

I am happy to merge this for 24.10 and optimize in 24.12, or just target 24.12.

The optimisations that is plausibly worthwhile is that we know that both tables have distinct rows and so we could use the new distinct_hash_join code in libcudf.

As you note, since this kind of thing is a somewhat common pattern, it might be worth implementing a libcudf primitive that, given two tables, returns the gather map that is the permutation from one table to the other.

I'd need to look if the join ordering code has the same invariants as the groupby ordering code.

@vyasr
Copy link
Contributor

vyasr commented Sep 25, 2024

After discussion, we're moving this to 24.12.

When we are requested to maintain order in groupby aggregations we
must post-process the result by computing a permutation between the
wanted order (of the input keys) and the order returned by the groupby
aggregation. To do this, we can perform a join between the two unique
key tables. Previously, we assumed that the gather map returned in
this join for the left (wanted order) table was the identity. However,
this is not guaranteed, in addition to computing the match between the
wanted key order and the key order we have, we must also apply the
permutation between the left gather map order and the identity.

- Closes rapidsai#16893
@wence- wence- changed the base branch from branch-24.10 to branch-24.12 September 27, 2024 08:50
@wence- wence- removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Sep 27, 2024
@wence-
Copy link
Contributor Author

wence- commented Sep 27, 2024

After discussion, we're moving this to 24.12.

Done.

@wence-
Copy link
Contributor Author

wence- commented Sep 30, 2024

/merge

@rapids-bot rapids-bot bot merged commit 2b6408b into rapidsai:branch-24.12 Sep 30, 2024
107 checks passed
@wence- wence- deleted the wence/fix/16893 branch September 30, 2024 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.polars Issues specific to cudf.polars non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] in polars-gpu, group_by(maintain_order=True) is not ordered
4 participants