[DISCUSSION] cuDF vs Pandas output order behavior #1781

beckernick · 2019-05-17T18:48:19Z

Migrating discussion from #1776.

Many users may expect merges/joins to maintain the row ordering of the original dataframe.

It is the nature of most parallel algorithms that the order of the outputs will not be deterministic unless you pay some cost to make them deterministic (e.g., sorting the outputs).

This has come up a few times, so I think it points to something we'll want to make sure we add to the documentation of cuDF as a whole that by default you shouldn't rely on any particular output order. If you need a particular order, then the onus is on the user to sort the data.

If the data is originally sorted by a column, we can of course just sort by that column (assuming we aren't concerned with maintaining the order of merged values for that key).

However, sorting after the fact may not be a panacea. Sorting afterward based on the original order is messy because we may not have the same number of records depending on the nature of the merge.

Additionally, if the merge is on "unsorted" data with ordering that we want to preserve, it's also less clear what to do. We don't have df.reindex functionality (#1779), but even if we did we'd end up passing too few or too many indices to reindex for the ordering if the merged dataframe has more or fewer rows.

Is the best solution simply to document for users that we can't preserve order and that sorting afterwards won't solve the problem unless the data was pre sorted by a specific column?

cc @randerzander @jrhemstad

The text was updated successfully, but these errors were encountered:

jrhemstad · 2019-05-17T19:00:04Z

This is an issue larger than just merge. It applies to many operations: groupby, drop duplicates, etc. (As such, I modified the title.)

We have to decide as a whole if cuDF is going to have the exact same ordering semantics as Pandas across all operations.

I'm uneasy about trying to guarantee that we'll always match the order one would expect from Pandas by default. Pandas is an inherently sequential library, and as such, the order will always be the same. cuDF is a GPU accelerated, parallel library that allows people to solve the same problems they do with Pandas.

It is one of the first lessons in parallel computing that when you start doing things in parallel, you go faster, but your output order is no longer deterministic (for most things). If you want a specific order, then you have to incur a cost to achieve that. Rarely does this limit the problems you can solve, you just change how you think a little bit. It doesn't seem unreasonable to ask our users to learn this lesson via quality documentation and training.

To be clear, we can surely add an option to all of these APIs to preserve the order Pandas would provide. However, I do not believe this should be the default as it will cause noticeable performance degradation.

It boils down to which is more important to the goal of cuDF: being as fast as possible? or being as close to Pandas as possible?

kkraus14 · 2019-05-17T19:02:52Z

I'm going to go on the record here and say that I am in strong opposition to forcing libcudf to return things in a specific order. That being said, we should strive to pass the index column(s) to operations so that even if the order returned is non-deterministic, the index column(s) can be used to sort the resulting dataframe to get a deterministic ordering if needed / desired.

beckernick · 2019-05-17T19:13:00Z

Using outer join as an example, @kkraus14 what is your opinion on the right behavior if the resultant dataframe has more rows than the original dataframe. What would you imagine the right expected behavior to be when sorting a dataframe containing 19 indices (for example) by the 15 original indices.

harrism · 2019-05-17T21:09:44Z

Anecdotally, when CUDA was new, we had to educate users about the non-associativity of floating point arithmetic and how since parallel operations (like reductions) can't guarantee the same order of operations as sequential operations, users couldn't expect bit-exact equivalence between their sequential and parallel results.

I believe the same type of education is necessary here. Note that the floating point difference applies to cuDF vs. Pandas as well...

see rapidsai#1781

trivialfis · 2021-08-16T11:40:56Z

Just ran into this issue. I work around it by caching the result as parquet files.

For details, I'm trying to profile XGBoost, but non-deterministic data leads to non-deterministic models so I can't compare 2 profiling results, and it took me some time to actually figure this out. The behavior is unexpected and I have spent most of the time verifying other places.

trivialfis · 2021-08-16T11:51:36Z

Thanks for looking into and discussing this though, I was able to find this issue quickly once I realized the possibility.

This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are important for us to document and communicate consistently. I am not sure what the best place for this information is, but I think the developer docs are a good place to start since until we address #11481 we don't have a great way to publish any non-API user-facing libcudf documentation. I've created this draft PR to solicit feedback from other libcudf devs about other policies that we should be documenting in a similar manner. Once everyone is happy with the contents, I would suggest that we merge this into the dev docs for now and then revisit a better place once we've tackled #11481. Partly addresses #5505, #1781. Resolves #4511. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #11853

vyasr · 2024-05-10T22:14:06Z

This issue contains a lot of great historical context, but I think it is no longer strictly relevant. Based on current trends, our plan is going to be the following:

libcudf will continue to not make any ordering guarantees
cuDF Python will expose options to turn on and off ordering guarantees, likely on a per-algorithm (or per-algorithm-group, TBD) basis.
When pandas compatibility mode (pandas_compatible) is enabled, cudf will automatically turn on all such guarantees, possibly at the expense of performance.
cudf.pandas will always be run with the pandas_compatible option turned on.

Support for stable ordering in specific algorithms will be tracked in algorithm-specific issues like #14001.

jrhemstad changed the title ~~[DISCUSSION] Preserving row order in merges (SQL style joins)~~ [DISCUSSION] cuDF vs Pandas output order behavior May 17, 2019

ayushdg mentioned this issue May 20, 2019

[BUG] Merging on index gives incorrect index in result #1806

Closed

madsbk added a commit to madsbk/cudf that referenced this issue Jun 3, 2019

testing: the row order after a merge can differ in cuDF and Pandas

6d5122e

see rapidsai#1781

kkraus14 mentioned this issue Jul 3, 2019

[BUG] Merges (SQL style joins) don't maintain original row ordering when key columns are different #1776

Closed

kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code labels Jul 3, 2019

kkraus14 mentioned this issue Jul 15, 2019

[BUG] left join style merge reorders resultant dataframe #2259

Closed

jrhemstad mentioned this issue Mar 18, 2020

[BUG] Non-determinism in cudf.MultiIndex.from_product #4578

Closed

jrhemstad mentioned this issue Apr 23, 2020

[FEA] preserve_order for apply_rows and merge #4997

Closed

philtrade mentioned this issue May 26, 2020

[BUG] cudf.DataFrame.drop_duplicates() alters original row order, pandas doesn't. #5286

Closed

jrhemstad mentioned this issue Jun 15, 2020

[BUG] Python Merge docs indicate key order is maintained #5476

Closed

jrhemstad mentioned this issue Jun 23, 2020

[FEA] unique() doesn't guarantee ordering as values are seen and doesn't include None #5554

Closed

vyasr mentioned this issue Oct 17, 2022

Initial draft of policies and guidelines for libcudf usage. #11853

Merged

3 tasks

vyasr mentioned this issue Nov 23, 2022

[FEA] Add a new cuDF option stable_sort that provides ordering guarantees for otherwise nondeterministic APIs #12236

Closed

vyasr mentioned this issue May 1, 2023

[BUG] cuDF inner join and iloc seemingly broken #13254

Closed

bdice mentioned this issue Aug 29, 2023

[BUG] DataFrame.join(Series) does not maintain order of Series values in mode.pandas_compatible #14001

Closed

vyasr closed this as completed May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] cuDF vs Pandas output order behavior #1781

[DISCUSSION] cuDF vs Pandas output order behavior #1781

beckernick commented May 17, 2019 •

edited

Loading

jrhemstad commented May 17, 2019 •

edited

Loading

kkraus14 commented May 17, 2019

beckernick commented May 17, 2019 •

edited

Loading

harrism commented May 17, 2019

trivialfis commented Aug 16, 2021 •

edited

Loading

trivialfis commented Aug 16, 2021 •

edited

Loading

vyasr commented May 10, 2024

[DISCUSSION] cuDF vs Pandas output order behavior #1781

[DISCUSSION] cuDF vs Pandas output order behavior #1781

Comments

beckernick commented May 17, 2019 • edited Loading

jrhemstad commented May 17, 2019 • edited Loading

kkraus14 commented May 17, 2019

beckernick commented May 17, 2019 • edited Loading

harrism commented May 17, 2019

trivialfis commented Aug 16, 2021 • edited Loading

trivialfis commented Aug 16, 2021 • edited Loading

vyasr commented May 10, 2024

beckernick commented May 17, 2019 •

edited

Loading

jrhemstad commented May 17, 2019 •

edited

Loading

beckernick commented May 17, 2019 •

edited

Loading

trivialfis commented Aug 16, 2021 •

edited

Loading

trivialfis commented Aug 16, 2021 •

edited

Loading