-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] loc-based lookup where index has repeated labels does not (always) match pandas order #13658
Closed
Tracked by
#12793
Labels
2 - In Progress
Currently a work in progress
bug
Something isn't working
Python
Affects Python cuDF API.
Milestone
Comments
wence-
added
bug
Something isn't working
Needs Triage
Need team to review and classify
labels
Jul 4, 2023
wence-
added
0 - Backlog
In queue waiting for assignment
Python
Affects Python cuDF API.
and removed
Needs Triage
Need team to review and classify
labels
Jul 4, 2023
wence-
changed the title
[BUG] loc-based lookup in Series where index has repeated labels does not (always) match pandas order
[BUG] loc-based lookup in where index has repeated labels does not (always) match pandas order
Jul 4, 2023
wence-
added a commit
to wence-/cudf
that referenced
this issue
Jul 4, 2023
If the index has duplicate labels then it is not sufficient to sort with the ordering induced by gathering the requested keys since the duplicate keys arrive in non-deterministic order from the join in the table that is being looked up from. To fix this, sort additionally by a secondary key that is the ordering induced by gathering from the donor table. - Closes rapidsai#13658
wence-
added a commit
to wence-/cudf
that referenced
this issue
Jul 4, 2023
If the index has duplicate labels then it is not sufficient to sort with the ordering induced by gathering the requested keys since the duplicate keys arrive in non-deterministic order from the join in the table that is being looked up from. To fix this, sort additionally by a secondary key that is the ordering induced by gathering from the donor table. - Closes rapidsai#13658
3 tasks
wence-
changed the title
[BUG] loc-based lookup in where index has repeated labels does not (always) match pandas order
[BUG] loc-based lookup where index has repeated labels does not (always) match pandas order
Jul 4, 2023
wence-
added
2 - In Progress
Currently a work in progress
and removed
0 - Backlog
In queue waiting for assignment
labels
Jul 5, 2023
wence-
added a commit
to wence-/cudf
that referenced
this issue
Jul 6, 2023
If the index has duplicate labels then it is not sufficient to sort with the ordering induced by gathering the requested keys since the duplicate keys arrive in non-deterministic order from the join in the table that is being looked up from. To fix this, sort additionally by a secondary key that is the ordering induced by gathering from the donor table. - Closes rapidsai#13658
rapids-bot bot
pushed a commit
that referenced
this issue
Jul 12, 2023
If the index has duplicate labels then it is not sufficient to sort with the ordering induced by gathering the requested keys since the duplicate keys arrive in non-deterministic order from the join in the table that is being looked up from. To fix this, sort additionally by a secondary key that is the ordering induced by gathering from the donor table. - Closes #13658 Authors: - Lawrence Mitchell (https://github.com/wence-) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13659
3 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2 - In Progress
Currently a work in progress
bug
Something isn't working
Python
Affects Python cuDF API.
Describe the bug
This occurs both for series and dataframes, though the exact mechanism by which it does is slightly different.
Context: an index is allowed to have repeated labels, and loc-based lookup should preserve the order of the rows it obtains when gathering using these repeated labels.
How does it work for Series?
The labels are first mapped to positional locations with a left join, and the resulting (right) gather map is sorted by a gathered
arange
column that is gathered using the left gather map. However, if the index contains repeated values, then the left gather map will have repeated entries, and so sorting will either non-deterministically flip the order (if the sort is not stable), or keep the existing order. So then we are relying on the join code correctly preserving the order (but it doesn't do that).How does it work for DataFrame?
The labels are looked up with an inner join, carrying along an extra column that is just an
arange
. The resulting dataframe is sorted by this extra column. So the sorting has the same potential for non-determinism, but since the inner join code path is different it doesn't exhibit under the same circumstances.[Aside, using an inner join means that if you want to raise
KeyError
for missing keys, you need a subsequent left_anti join to find the missing keys, but that is tracked by #13379].Steps/Code to reproduce bug
Series example
Worse (perhaps), in some circumstances, successive return values of the same call do not return the same order (a consequence of the lack of stable sorting I suppose).
Dataframe example
Expected behavior
If there's an ordering guarantee in place, it should work. This can be obtained by sorting not just on the ordering induced by the right gather map, but also the left one:
And, for dataframes:
The text was updated successfully, but these errors were encountered: