[BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible #14001

mroeschke · 2023-08-29T22:33:30Z

Describe the bug
DataFrame.join(Series) does not maintain order of Series values

Steps/Code to reproduce bug

In [12]: import cudf

In [13]: df = cudf.DataFrame([1, 2])

In [16]: ser = cudf.Series(list("abcdef"), index=[0, 0, 0, 1, 1, 1], name="var1")

In [17]: df.join(ser)
Out[17]: 
   0 var1
0  1    a
0  1    c
0  1    b
1  2    d
1  2    e
1  2    f

In [18]: df.to_pandas().join(ser.to_pandas())
Out[18]: 
   0 var1
0  1    a
0  1    b
0  1    c
1  2    d
1  2    e
1  2    f

Expected behavior
I would expect the order of the joined values to be maintained.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of cuDF install: conda
- If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

bdice · 2023-08-29T22:42:01Z

We should only expect this is a bug when mode.pandas_compatible is set to True, right?

Otherwise these docs about no result ordering guarantee applies, from what I can tell.

mroeschke · 2023-08-29T22:45:15Z

Ah totally forgot about the ordering guarantee. Yes, this is probably a bug only when cudf.set_option("mode.pandas_compatible", True) (the OP still reproduces even when setting it)

bdice · 2023-08-29T23:16:36Z

This will likely require some nontrivial changes, like adding a sequential index column before joining, then sorting after the join? Or perhaps some intermediate sorting of the join indices (can't recall the implementation details to know if this is viable at the moment). Either way, preserving order for a hash join is not ideal for performance, which is why this hasn't been done yet. I am not sure if there's any better way -- we did find some clever ways to implement drop_duplicates in an order-preserving way, and perhaps the same can be done here. I suspect libcudf changes will be needed to resolve this (at least, in an efficient way). If you'd like me to pursue this, please let me know. Some related issues/PRs:

shwina · 2023-10-23T10:24:47Z

like adding a sequential index column before joining, then sorting after the join

I think we should go ahead and implement this solution for now (when Pandas compatibility is enabled).

shwina · 2023-10-24T14:40:34Z

@mroeschke -- by default, sort=False in the .join() method.

Is it an implementation detail of Pandas that when sort=False, ordering of keys is preserved? Or is it a guarantee? (I couldn't find where in the docs this is defined).

cc: @wence-

mroeschke · 2023-10-24T15:35:12Z

join uses merge under the hood so the key ordering follow what's described in the how section here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

If we pass sort=True to merges we are on the hook to sort the result in order with respect to the key columns. If those key columns have repeated values there is still some space for ambiguity. Currently we get a result back whose order (for the repeated key values) is determined by the gather map that libcudf returns for the join. This does not come with any ordering guarantees. When sort=False, pandas has join-type dependent ordering guarantees which we also do not match. To fix this, in pandas-compatible mode only, reorder the gather maps according to the order of the input keys. When sort=False this means that our result matches pandas ordering. When sort=True, it ensures that (if we use a stable sort) the tie-break for equal sort keys is the input dataframe order. While we're here, switch from argsort + gather to sort_by_key when sorting results. - Closes rapidsai#14001 - Closes rapidsai/xdf#385

If we pass sort=True to merges we are on the hook to sort the result in order with respect to the key columns. If those key columns have repeated values there is still some space for ambiguity. Currently we get a result back whose order (for the repeated key values) is determined by the gather map that libcudf returns for the join. This does not come with any ordering guarantees. When sort=False, pandas has join-type dependent ordering guarantees which we also do not match. To fix this, in pandas-compatible mode only, reorder the gather maps according to the order of the input keys. When sort=False this means that our result matches pandas ordering. When sort=True, it ensures that (if we use a stable sort) the tie-break for equal sort keys is the input dataframe order. While we're here, switch from argsort + gather to sort_by_key when sorting results. - Closes rapidsai#14001

) If we pass sort=True to merges we are on the hook to sort the result in order with respect to the key columns. If those key columns have repeated values there is still some space for ambiguity. Currently we get a result back whose order (for the repeated key values) is determined by the gather map that libcudf returns for the join. This does not come with any ordering guarantees. When sort=False, pandas has join-type dependent ordering guarantees which we also do not match. To fix this, in pandas-compatible mode only, reorder the gather maps according to the order of the input keys. When sort=False this means that our result matches pandas ordering. When sort=True, it ensures that (if we use a stable sort) the tie-break for equal sort keys is the input dataframe order. While we're here, switch from argsort + gather to sort_by_key when sorting results. - Closes #14001 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) URL: #14428

wence- · 2024-09-30T11:33:14Z

Cleaning up assigned issues, this is now fixed in 24.12, sorry for the delay!

mroeschke added bug Something isn't working Needs Triage Need team to review and classify Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 29, 2023

mroeschke changed the title ~~[BUG] DataFrame.join(Series) does not maintain order of Series values~~ [BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible Aug 29, 2023

galipremsagar self-assigned this Sep 1, 2023

galipremsagar assigned wence- and unassigned galipremsagar Oct 24, 2023

wence- mentioned this issue Nov 16, 2023

Match pandas join ordering obligations in pandas-compatible mode #14428

Merged

3 tasks

galipremsagar mentioned this issue Nov 28, 2023

[FEA] Test known differences between cudf and pandas #14519

Open

vyasr mentioned this issue May 10, 2024

[DISCUSSION] cuDF vs Pandas output order behavior #1781

Closed

wence- closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible #14001

[BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible #14001

mroeschke commented Aug 29, 2023

bdice commented Aug 29, 2023

mroeschke commented Aug 29, 2023

bdice commented Aug 29, 2023

shwina commented Oct 23, 2023

shwina commented Oct 24, 2023

mroeschke commented Oct 24, 2023

wence- commented Sep 30, 2024

[BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible #14001

[BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible #14001

Comments

mroeschke commented Aug 29, 2023

bdice commented Aug 29, 2023

mroeschke commented Aug 29, 2023

bdice commented Aug 29, 2023

shwina commented Oct 23, 2023

shwina commented Oct 24, 2023

mroeschke commented Oct 24, 2023

wence- commented Sep 30, 2024