Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible #14001

Closed
mroeschke opened this issue Aug 29, 2023 · 7 comments
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@mroeschke
Copy link
Contributor

Describe the bug
DataFrame.join(Series) does not maintain order of Series values

Steps/Code to reproduce bug

In [12]: import cudf

In [13]: df = cudf.DataFrame([1, 2])

In [16]: ser = cudf.Series(list("abcdef"), index=[0, 0, 0, 1, 1, 1], name="var1")

In [17]: df.join(ser)
Out[17]: 
   0 var1
0  1    a
0  1    c
0  1    b
1  2    d
1  2    e
1  2    f

In [18]: df.to_pandas().join(ser.to_pandas())
Out[18]: 
   0 var1
0  1    a
0  1    b
0  1    c
1  2    d
1  2    e
1  2    f

Expected behavior
I would expect the order of the joined values to be maintained.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: conda
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

@mroeschke mroeschke added bug Something isn't working Needs Triage Need team to review and classify Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 29, 2023
@bdice
Copy link
Contributor

bdice commented Aug 29, 2023

We should only expect this is a bug when mode.pandas_compatible is set to True, right?

Otherwise these docs about no result ordering guarantee applies, from what I can tell.

@mroeschke
Copy link
Contributor Author

Ah totally forgot about the ordering guarantee. Yes, this is probably a bug only when cudf.set_option("mode.pandas_compatible", True) (the OP still reproduces even when setting it)

@mroeschke mroeschke changed the title [BUG] DataFrame.join(Series) does not maintain order of Series values [BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible Aug 29, 2023
@bdice
Copy link
Contributor

bdice commented Aug 29, 2023

This will likely require some nontrivial changes, like adding a sequential index column before joining, then sorting after the join? Or perhaps some intermediate sorting of the join indices (can't recall the implementation details to know if this is viable at the moment). Either way, preserving order for a hash join is not ideal for performance, which is why this hasn't been done yet. I am not sure if there's any better way -- we did find some clever ways to implement drop_duplicates in an order-preserving way, and perhaps the same can be done here. I suspect libcudf changes will be needed to resolve this (at least, in an efficient way). If you'd like me to pursue this, please let me know. Some related issues/PRs:

@galipremsagar galipremsagar self-assigned this Sep 1, 2023
@shwina
Copy link
Contributor

shwina commented Oct 23, 2023

like adding a sequential index column before joining, then sorting after the join

I think we should go ahead and implement this solution for now (when Pandas compatibility is enabled).

@shwina
Copy link
Contributor

shwina commented Oct 24, 2023

@mroeschke -- by default, sort=False in the .join() method.

Is it an implementation detail of Pandas that when sort=False, ordering of keys is preserved? Or is it a guarantee? (I couldn't find where in the docs this is defined).

cc: @wence-

@mroeschke
Copy link
Contributor Author

join uses merge under the hood so the key ordering follow what's described in the how section here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

wence- added a commit to wence-/cudf that referenced this issue Nov 16, 2023
If we pass sort=True to merges we are on the hook to sort the result
in order with respect to the key columns. If those key columns have
repeated values there is still some space for ambiguity. Currently we
get a result back whose order (for the repeated key values) is
determined by the gather map that libcudf returns for the join. This
does not come with any ordering guarantees.

When sort=False, pandas has join-type dependent ordering guarantees
which we also do not match. To fix this, in pandas-compatible mode
only, reorder the gather maps according to the order of the input
keys. When sort=False this means that our result matches pandas
ordering. When sort=True, it ensures that (if we use a stable sort)
the tie-break for equal sort keys is the input dataframe order.

While we're here, switch from argsort + gather to sort_by_key when
sorting results.

- Closes rapidsai#14001
- Closes rapidsai/xdf#385
wence- added a commit to wence-/cudf that referenced this issue Nov 16, 2023
If we pass sort=True to merges we are on the hook to sort the result
in order with respect to the key columns. If those key columns have
repeated values there is still some space for ambiguity. Currently we
get a result back whose order (for the repeated key values) is
determined by the gather map that libcudf returns for the join. This
does not come with any ordering guarantees.

When sort=False, pandas has join-type dependent ordering guarantees
which we also do not match. To fix this, in pandas-compatible mode
only, reorder the gather maps according to the order of the input
keys. When sort=False this means that our result matches pandas
ordering. When sort=True, it ensures that (if we use a stable sort)
the tie-break for equal sort keys is the input dataframe order.

While we're here, switch from argsort + gather to sort_by_key when
sorting results.

- Closes rapidsai#14001
wence- added a commit to wence-/cudf that referenced this issue Nov 17, 2023
If we pass sort=True to merges we are on the hook to sort the result
in order with respect to the key columns. If those key columns have
repeated values there is still some space for ambiguity. Currently we
get a result back whose order (for the repeated key values) is
determined by the gather map that libcudf returns for the join. This
does not come with any ordering guarantees.

When sort=False, pandas has join-type dependent ordering guarantees
which we also do not match. To fix this, in pandas-compatible mode
only, reorder the gather maps according to the order of the input
keys. When sort=False this means that our result matches pandas
ordering. When sort=True, it ensures that (if we use a stable sort)
the tie-break for equal sort keys is the input dataframe order.

While we're here, switch from argsort + gather to sort_by_key when
sorting results.

- Closes rapidsai#14001
rapids-bot bot pushed a commit that referenced this issue Nov 17, 2023
)

If we pass sort=True to merges we are on the hook to sort the result in order with respect to the key columns. If those key columns have repeated values there is still some space for ambiguity. Currently we get a result back whose order (for the repeated key values) is determined by the gather map that libcudf returns for the join. This does not come with any ordering guarantees.

When sort=False, pandas has join-type dependent ordering guarantees which we also do not match. To fix this, in pandas-compatible mode only, reorder the gather maps according to the order of the input keys. When sort=False this means that our result matches pandas ordering. When sort=True, it ensures that (if we use a stable sort) the tie-break for equal sort keys is the input dataframe order.

While we're here, switch from argsort + gather to sort_by_key when sorting results.

- Closes #14001

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Bradley Dice (https://github.com/bdice)

URL: #14428
@wence-
Copy link
Contributor

wence- commented Sep 30, 2024

Cleaning up assigned issues, this is now fixed in 24.12, sorry for the delay!

@wence- wence- closed this as completed Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

5 participants