Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] loc-based lookup where index has repeated labels does not (always) match pandas order #13658

Closed
Tracked by #12793
wence- opened this issue Jul 4, 2023 · 0 comments · Fixed by #13659 · May be fixed by #13717
Closed
Tracked by #12793

[BUG] loc-based lookup where index has repeated labels does not (always) match pandas order #13658

wence- opened this issue Jul 4, 2023 · 0 comments · Fixed by #13659 · May be fixed by #13717
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Jul 4, 2023

Describe the bug

This occurs both for series and dataframes, though the exact mechanism by which it does is slightly different.

Context: an index is allowed to have repeated labels, and loc-based lookup should preserve the order of the rows it obtains when gathering using these repeated labels.

How does it work for Series?

The labels are first mapped to positional locations with a left join, and the resulting (right) gather map is sorted by a gathered arange column that is gathered using the left gather map. However, if the index contains repeated values, then the left gather map will have repeated entries, and so sorting will either non-deterministically flip the order (if the sort is not stable), or keep the existing order. So then we are relying on the join code correctly preserving the order (but it doesn't do that).

How does it work for DataFrame?

The labels are looked up with an inner join, carrying along an extra column that is just an arange. The resulting dataframe is sorted by this extra column. So the sorting has the same potential for non-determinism, but since the inner join code path is different it doesn't exhibit under the same circumstances.

[Aside, using an inner join means that if you want to raise KeyError for missing keys, you need a subsequent left_anti join to find the missing keys, but that is tracked by #13379].

Steps/Code to reproduce bug

Series example

import cudf
series = cudf.Series([0, 1, 2, 3, 4], index=[0, 1, 2, 1, 4])
subset = series.loc[[1]]
print(subset)
# 1    3
# 1    1
# dtype: int64
series.to_pandas().loc[[1]]
# 1    1
# 1    3
# dtype: int64

Worse (perhaps), in some circumstances, successive return values of the same call do not return the same order (a consequence of the lack of stable sorting I suppose).

In [56]: values = range(128)

In [57]: index = list(values)

In [58]: index[0] = 1

In [59]: series = cudf.Series(values, index=index)

In [60]: series.loc[[1]]
Out[60]: 
1    1
1    0
dtype: int64

In [61]: series.loc[[1]]
Out[61]: 
1    0
1    1
dtype: int64

Dataframe example

import cudf
values = range(2048)
index = list(values)
index[2041] = 1
df = cudf.DataFrame({"a": values}, index=index)
df.loc[[1]]
#       a
# 1  2041
# 1     1

df.to_pandas().loc[[1]]
#       a
# 1     1
# 1  2041

Expected behavior

If there's an ordering guarantee in place, it should work. This can be obtained by sorting not just on the ordering induced by the right gather map, but also the left one:

diff --git a/python/cudf/cudf/core/indexed_frame.py b/python/cudf/cudf/core/indexed_frame.py
index eba5fada7d..b9b6091528 100644
--- a/python/cudf/cudf/core/indexed_frame.py
+++ b/python/cudf/cudf/core/indexed_frame.py
@@ -190,7 +190,7 @@ def _indices_from_labels(obj, labels):
     rhs = cudf.DataFrame(
         {"_": cudf.core.column.arange(len(obj))}, index=obj.index
     )
-    return lhs.join(rhs).sort_values("__")["_"]
+    return lhs.join(rhs).sort_values(by=["__", "_"])["_"]
 
 
 def _get_label_range_or_mask(index, start, stop, step):

And, for dataframes:

diff --git a/python/cudf/cudf/core/dataframe.py b/python/cudf/cudf/core/dataframe.py
index d1c807279d..9ca24bdf8b 100644
--- a/python/cudf/cudf/core/dataframe.py
+++ b/python/cudf/cudf/core/dataframe.py
@@ -291,12 +291,14 @@ class _DataFrameLocIndexer(_DataFrameIndexer):
                         {tmp_col_name: column.arange(len(tmp_arg[0]))},
                         index=as_index(tmp_arg[0]),
                     )
+                    cantor_name = "_" + "_".join(columns_df.columns)
+                    columns_df[cantor_name] = column.arange(len(columns_df))
                     df = other_df.join(columns_df, how="inner")
                     # as join is not assigning any names to index,
                     # update it over here
                     df.index.name = columns_df.index.name
-                    df = df.sort_values(tmp_col_name)
-                    df.drop(columns=[tmp_col_name], inplace=True)
+                    df = df.sort_values(by=[tmp_col_name, cantor_name])
+                    df.drop(columns=[tmp_col_name, cantor_name], inplace=True)
                     # There were no indices found
                     if len(df) == 0:
                         raise KeyError(arg)
@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify labels Jul 4, 2023
@wence- wence- self-assigned this Jul 4, 2023
@wence- wence- added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jul 4, 2023
@wence- wence- changed the title [BUG] loc-based lookup in Series where index has repeated labels does not (always) match pandas order [BUG] loc-based lookup in where index has repeated labels does not (always) match pandas order Jul 4, 2023
wence- added a commit to wence-/cudf that referenced this issue Jul 4, 2023
If the index has duplicate labels then it is not sufficient to sort
with the ordering induced by gathering the requested keys since the
duplicate keys arrive in non-deterministic order from the join in the
table that is being looked up from. To fix this, sort additionally by
a secondary key that is the ordering induced by gathering from the
donor table.

- Closes rapidsai#13658
wence- added a commit to wence-/cudf that referenced this issue Jul 4, 2023
If the index has duplicate labels then it is not sufficient to sort
with the ordering induced by gathering the requested keys since the
duplicate keys arrive in non-deterministic order from the join in the
table that is being looked up from. To fix this, sort additionally by
a secondary key that is the ordering induced by gathering from the
donor table.

- Closes rapidsai#13658
@wence- wence- changed the title [BUG] loc-based lookup in where index has repeated labels does not (always) match pandas order [BUG] loc-based lookup where index has repeated labels does not (always) match pandas order Jul 4, 2023
@wence- wence- added 2 - In Progress Currently a work in progress and removed 0 - Backlog In queue waiting for assignment labels Jul 5, 2023
wence- added a commit to wence-/cudf that referenced this issue Jul 6, 2023
If the index has duplicate labels then it is not sufficient to sort
with the ordering induced by gathering the requested keys since the
duplicate keys arrive in non-deterministic order from the join in the
table that is being looked up from. To fix this, sort additionally by
a secondary key that is the ordering induced by gathering from the
donor table.

- Closes rapidsai#13658
rapids-bot bot pushed a commit that referenced this issue Jul 12, 2023
If the index has duplicate labels then it is not sufficient to sort with the ordering induced by gathering the requested keys since the duplicate keys arrive in non-deterministic order from the join in the table that is being looked up from. To fix this, sort additionally by a secondary key that is the ordering induced by gathering from the donor table.

- Closes #13658

Authors:
  - Lawrence Mitchell (https://github.com/wence-)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #13659
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
1 participant