Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] loc-based indexing with slice ranges inconsistent with pandas #12833

Closed
Tracked by #12793
wence- opened this issue Feb 23, 2023 · 0 comments · Fixed by #13625
Closed
Tracked by #12793

[BUG] loc-based indexing with slice ranges inconsistent with pandas #12833

wence- opened this issue Feb 23, 2023 · 0 comments · Fixed by #13625
Assignees
Labels
bug Something isn't working improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Feb 23, 2023

Describe the bug

When indexing a frame with .loc one can provide a slice range where the endpoints of the slice are not contained in the index. If the index is not sorted then both pandas and cudf agree that you receive an error.

import cudf

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[7, 0, 4])

df.loc[1:5] # => ValueError
df.to_pandas().loc[1:5] # => KeyError

Probably we should fix the distinction in the exception that is raised.

However, when the index is sorted, behaviour starts to differ:

When the index is monotone increasing (repeated values are allowed), and the slice has a positive step, then things work

import cudf
df = cudf.DataFrame({"a": [1, 2, 3]}, index=[0, 4, 7])

df.loc[1:5] == df.to_pandas().loc[1:5]

If the slice has negative stride, then cudf produces the wrong answer:

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[0, 4, 7])
df.loc[5:0:-1] 
#   a
# 7  3
df.to_pandas().loc[5:0:-1]
#    a
# 4  2
# 0  1

If the index is monotone decreasing, and the slice has positive step, then the behaviour matches if both slice endpoints are in the index, and the behaviour differs if at least one of the slice end points is not in the index, in these circumstances, pandas always returns an empty dataframe, and cudf always produces a ValueError.

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[7, 4, 0])
df.loc[0:4] == df.to_pandas().loc[0:4] # empty dataframe
df.loc[1:4] # => ValueError
df.to_pandas().loc[1:4] # empty data frame

If the slice has negative step, then the behaviour also differs, but differently. If both slice endpoints are in the index, then cudf returns an empty dataframe, whereas pandas returns the (reordered) dataframe sliced by the endpoints

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[7, 4, 0])
df.loc[0:4:-1] # empty dataframe
df.to_pandas().loc[0:4:-1]
#    a
# 0  3
# 4  2

If at least one of the slice endpoints is not in the index, then cudf produces a ValueError and pandas behaves as for the monotone increasing case.

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[7, 4, 0])
df.loc[0:5:-1] # => ValueError
df.to_pandas().loc[0:5:-1]
#    a
# 0  3
# 4  2

The above issues apply, mutatis mutandis to almost all index types in almost all cases (although I have not exhaustively checked). Amusingly the monotone increasing case with positive step slice with endpoint not in the index works for FloatXXXIndex.

Expected behavior

We should be consistent, both internally, and probably matching pandas where possible (raising errors otherwise).

@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify labels Feb 23, 2023
@wence- wence- self-assigned this Feb 23, 2023
@wence- wence- added improvement Improvement / enhancement to an existing function Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 2, 2023
wence- added a commit to wence-/cudf that referenced this issue Jun 27, 2023
When the indices are sorted, we need to a bit more work in
get_slice_bound to handle those cases. But having done so correctly we
can push the implementation of find_label_range to BaseIndex and just
specialise get_slice_bound in specific implementations.

- Closes rapidsai#12833
rapids-bot bot pushed a commit that referenced this issue Jun 30, 2023
…3625)

Much of the index-specific search code for label-based lookup only worked in the case where the index was sorted in ascending order and/or the requested slice had the same step sign as the index. To fix this, handle ascending and descending sorted indices specially, as well as refactoring to remove unused codepaths.

The core idea is to push `find_first` and `find_last` to being generic column methods (in doing so we remove the `closest` argument, but that was only used to produce pandas-incompatible index behaviour).

Lookup in indices then uses `get_slice_bound` (that can be specialised for index types) that uses first (if applicable) `searchsorted` and then `find_first/last`.

While we're here, since we now check the sortedness properties of Index objects, turn them into `@cached_property` (partially addressing the request of #13357).

- Closes #8693
- Closes #12833

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #13625
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant