-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Merging on index gives incorrect index in result #1806
Comments
Might be somewhat related to the discussion here: #1781 and how cudf handles non deterministic ordering and indices. |
It looks like it is just the ordering problem, coupled with a type promotion that cudf didn't do. My vote is not a bug. I'm not sure if we'll ever guarantee same order as Pandas? |
The ordering my not match pandas but indices corresponding to the new ordering by cudf is incorrect. If we were to change the order up the indices of the result should match the order:
|
The indices are the implicit ordering of the data, we can't construct an index without an order. When sort based groupby is implemented we can likely match pandas exactly. Depends on #1342 |
Sorry for the confusion. This is w.r.t the Assigns the index to be a column as well that gets passed to the cpp layer during the merge. When the ordering or columns gets changed it should also affect the ordering of this column (lhs[on]/rhs[on]) which was originally the index of my df. So after getting the output if we set the output index to match the lhs[on] column, the results should be consistent. I'm sorry for being relatively vague, I'll try to debug and play around with my example a bit more to see check if my theory is correct or not. |
Hmm I was under the impression that merge, join, groupby, etc all use the same hash based join method, which discards ordering as a result. I think you're right, though, that this is a slightly different issue. I suspect that our "index" object is being completely dropped from the merge, ignored, and then added back to the result of the merge. If the index was treated as another column during the merge, the index data should persist as you suggested in |
This is something that can probably be fixed in cudf, in fact. |
Like this lol
Thanks for setting me straight :) |
Describe the bug
When merging two dataframes based on index the result does not seem consistent with pandas
Steps/Code to reproduce bug
Output:
Expected behavior
Output:
Environment details (please complete the following information):
cudf/print_env.sh
script to gather relevant environment details: On 0.8 at commit a66db9eAdditional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: