-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] cuDF vs Pandas output order behavior #1781
Comments
This is an issue larger than just We have to decide as a whole if cuDF is going to have the exact same ordering semantics as Pandas across all operations. I'm uneasy about trying to guarantee that we'll always match the order one would expect from Pandas by default. Pandas is an inherently sequential library, and as such, the order will always be the same. cuDF is a GPU accelerated, parallel library that allows people to solve the same problems they do with Pandas. It is one of the first lessons in parallel computing that when you start doing things in parallel, you go faster, but your output order is no longer deterministic (for most things). If you want a specific order, then you have to incur a cost to achieve that. Rarely does this limit the problems you can solve, you just change how you think a little bit. It doesn't seem unreasonable to ask our users to learn this lesson via quality documentation and training. To be clear, we can surely add an option to all of these APIs to preserve the order Pandas would provide. However, I do not believe this should be the default as it will cause noticeable performance degradation. It boils down to which is more important to the goal of cuDF: being as fast as possible? or being as close to Pandas as possible? |
I'm going to go on the record here and say that I am in strong opposition to forcing libcudf to return things in a specific order. That being said, we should strive to pass the index column(s) to operations so that even if the order returned is non-deterministic, the index column(s) can be used to sort the resulting dataframe to get a deterministic ordering if needed / desired. |
Using outer join as an example, @kkraus14 what is your opinion on the right behavior if the resultant dataframe has more rows than the original dataframe. What would you imagine the right expected behavior to be when sorting a dataframe containing 19 indices (for example) by the 15 original indices. |
Anecdotally, when CUDA was new, we had to educate users about the non-associativity of floating point arithmetic and how since parallel operations (like reductions) can't guarantee the same order of operations as sequential operations, users couldn't expect bit-exact equivalence between their sequential and parallel results. I believe the same type of education is necessary here. Note that the floating point difference applies to cuDF vs. Pandas as well... |
Just ran into this issue. I work around it by caching the result as parquet files. For details, I'm trying to profile XGBoost, but non-deterministic data leads to non-deterministic models so I can't compare 2 profiling results, and it took me some time to actually figure this out. The behavior is unexpected and I have spent most of the time verifying other places. |
Thanks for looking into and discussing this though, I was able to find this issue quickly once I realized the possibility. |
This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are important for us to document and communicate consistently. I am not sure what the best place for this information is, but I think the developer docs are a good place to start since until we address #11481 we don't have a great way to publish any non-API user-facing libcudf documentation. I've created this draft PR to solicit feedback from other libcudf devs about other policies that we should be documenting in a similar manner. Once everyone is happy with the contents, I would suggest that we merge this into the dev docs for now and then revisit a better place once we've tackled #11481. Partly addresses #5505, #1781. Resolves #4511. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #11853
This issue contains a lot of great historical context, but I think it is no longer strictly relevant. Based on current trends, our plan is going to be the following:
Support for stable ordering in specific algorithms will be tracked in algorithm-specific issues like #14001. |
Migrating discussion from #1776.
Many users may expect merges/joins to maintain the row ordering of the original dataframe.
As @jrhemstad has noted:
If the data is originally sorted by a column, we can of course just sort by that column (assuming we aren't concerned with maintaining the order of merged values for that key).
However, sorting after the fact may not be a panacea. Sorting afterward based on the original order is messy because we may not have the same number of records depending on the nature of the merge.
Additionally, if the merge is on "unsorted" data with ordering that we want to preserve, it's also less clear what to do. We don't have
df.reindex
functionality (#1779), but even if we did we'd end up passing too few or too many indices to reindex for the ordering if the merged dataframe has more or fewer rows.Is the best solution simply to document for users that we can't preserve order and that sorting afterwards won't solve the problem unless the data was pre sorted by a specific column?
cc @randerzander @jrhemstad
The text was updated successfully, but these errors were encountered: