Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] cuDF vs Pandas output order behavior #1781

Closed
beckernick opened this issue May 17, 2019 · 7 comments
Closed

[DISCUSSION] cuDF vs Pandas output order behavior #1781

beckernick opened this issue May 17, 2019 · 7 comments
Labels
libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented May 17, 2019

Migrating discussion from #1776.

Many users may expect merges/joins to maintain the row ordering of the original dataframe.

As @jrhemstad has noted:

It is the nature of most parallel algorithms that the order of the outputs will not be deterministic unless you pay some cost to make them deterministic (e.g., sorting the outputs).

This has come up a few times, so I think it points to something we'll want to make sure we add to the documentation of cuDF as a whole that by default you shouldn't rely on any particular output order. If you need a particular order, then the onus is on the user to sort the data.

If the data is originally sorted by a column, we can of course just sort by that column (assuming we aren't concerned with maintaining the order of merged values for that key).

However, sorting after the fact may not be a panacea. Sorting afterward based on the original order is messy because we may not have the same number of records depending on the nature of the merge.

Additionally, if the merge is on "unsorted" data with ordering that we want to preserve, it's also less clear what to do. We don't have df.reindex functionality (#1779), but even if we did we'd end up passing too few or too many indices to reindex for the ordering if the merged dataframe has more or fewer rows.

Is the best solution simply to document for users that we can't preserve order and that sorting afterwards won't solve the problem unless the data was pre sorted by a specific column?

cc @randerzander @jrhemstad

@jrhemstad
Copy link
Contributor

jrhemstad commented May 17, 2019

This is an issue larger than just merge. It applies to many operations: groupby, drop duplicates, etc. (As such, I modified the title.)

We have to decide as a whole if cuDF is going to have the exact same ordering semantics as Pandas across all operations.

I'm uneasy about trying to guarantee that we'll always match the order one would expect from Pandas by default. Pandas is an inherently sequential library, and as such, the order will always be the same. cuDF is a GPU accelerated, parallel library that allows people to solve the same problems they do with Pandas.

It is one of the first lessons in parallel computing that when you start doing things in parallel, you go faster, but your output order is no longer deterministic (for most things). If you want a specific order, then you have to incur a cost to achieve that. Rarely does this limit the problems you can solve, you just change how you think a little bit. It doesn't seem unreasonable to ask our users to learn this lesson via quality documentation and training.

To be clear, we can surely add an option to all of these APIs to preserve the order Pandas would provide. However, I do not believe this should be the default as it will cause noticeable performance degradation.

It boils down to which is more important to the goal of cuDF: being as fast as possible? or being as close to Pandas as possible?

@jrhemstad jrhemstad changed the title [DISCUSSION] Preserving row order in merges (SQL style joins) [DISCUSSION] cuDF vs Pandas output order behavior May 17, 2019
@kkraus14
Copy link
Collaborator

I'm going to go on the record here and say that I am in strong opposition to forcing libcudf to return things in a specific order. That being said, we should strive to pass the index column(s) to operations so that even if the order returned is non-deterministic, the index column(s) can be used to sort the resulting dataframe to get a deterministic ordering if needed / desired.

@beckernick
Copy link
Member Author

beckernick commented May 17, 2019

Using outer join as an example, @kkraus14 what is your opinion on the right behavior if the resultant dataframe has more rows than the original dataframe. What would you imagine the right expected behavior to be when sorting a dataframe containing 19 indices (for example) by the 15 original indices.

@harrism
Copy link
Member

harrism commented May 17, 2019

Anecdotally, when CUDA was new, we had to educate users about the non-associativity of floating point arithmetic and how since parallel operations (like reductions) can't guarantee the same order of operations as sequential operations, users couldn't expect bit-exact equivalence between their sequential and parallel results.

I believe the same type of education is necessary here. Note that the floating point difference applies to cuDF vs. Pandas as well...

@trivialfis
Copy link
Member

trivialfis commented Aug 16, 2021

Just ran into this issue. I work around it by caching the result as parquet files.

For details, I'm trying to profile XGBoost, but non-deterministic data leads to non-deterministic models so I can't compare 2 profiling results, and it took me some time to actually figure this out. The behavior is unexpected and I have spent most of the time verifying other places.

@trivialfis
Copy link
Member

trivialfis commented Aug 16, 2021

Thanks for looking into and discussing this though, I was able to find this issue quickly once I realized the possibility.

rapids-bot bot pushed a commit that referenced this issue Oct 18, 2022
This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are important for us to document and communicate consistently. I am not sure what the best place for this information is, but I think the developer docs are a good place to start since until we address #11481 we don't have a great way to publish any non-API user-facing libcudf documentation. I've created this draft PR to solicit feedback from other libcudf devs about other policies that we should be documenting in a similar manner. Once everyone is happy with the contents, I would suggest that we merge this into the dev docs for now and then revisit a better place once we've tackled #11481.

Partly addresses #5505, #1781.

Resolves #4511.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #11853
@vyasr
Copy link
Contributor

vyasr commented May 10, 2024

This issue contains a lot of great historical context, but I think it is no longer strictly relevant. Based on current trends, our plan is going to be the following:

  • libcudf will continue to not make any ordering guarantees
  • cuDF Python will expose options to turn on and off ordering guarantees, likely on a per-algorithm (or per-algorithm-group, TBD) basis.
  • When pandas compatibility mode (pandas_compatible) is enabled, cudf will automatically turn on all such guarantees, possibly at the expense of performance.
  • cudf.pandas will always be run with the pandas_compatible option turned on.

Support for stable ordering in specific algorithms will be tracked in algorithm-specific issues like #14001.

@vyasr vyasr closed this as completed May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

6 participants