[BUG] `loc`-based indexing with slice ranges inconsistent with pandas #12833

wence- · 2023-02-23T16:18:22Z

Describe the bug

When indexing a frame with .loc one can provide a slice range where the endpoints of the slice are not contained in the index. If the index is not sorted then both pandas and cudf agree that you receive an error.

import cudf

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[7, 0, 4])

df.loc[1:5] # => ValueError
df.to_pandas().loc[1:5] # => KeyError

Probably we should fix the distinction in the exception that is raised.

However, when the index is sorted, behaviour starts to differ:

When the index is monotone increasing (repeated values are allowed), and the slice has a positive step, then things work

import cudf
df = cudf.DataFrame({"a": [1, 2, 3]}, index=[0, 4, 7])

df.loc[1:5] == df.to_pandas().loc[1:5]

If the slice has negative stride, then cudf produces the wrong answer:

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[0, 4, 7])
df.loc[5:0:-1] 
#   a
# 7  3
df.to_pandas().loc[5:0:-1]
#    a
# 4  2
# 0  1

If the index is monotone decreasing, and the slice has positive step, then the behaviour matches if both slice endpoints are in the index, and the behaviour differs if at least one of the slice end points is not in the index, in these circumstances, pandas always returns an empty dataframe, and cudf always produces a ValueError.

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[7, 4, 0])
df.loc[0:4] == df.to_pandas().loc[0:4] # empty dataframe
df.loc[1:4] # => ValueError
df.to_pandas().loc[1:4] # empty data frame

If the slice has negative step, then the behaviour also differs, but differently. If both slice endpoints are in the index, then cudf returns an empty dataframe, whereas pandas returns the (reordered) dataframe sliced by the endpoints

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[7, 4, 0])
df.loc[0:4:-1] # empty dataframe
df.to_pandas().loc[0:4:-1]
#    a
# 0  3
# 4  2

If at least one of the slice endpoints is not in the index, then cudf produces a ValueError and pandas behaves as for the monotone increasing case.

df = cudf.DataFrame({"a": [1, 2, 3]}, index=[7, 4, 0])
df.loc[0:5:-1] # => ValueError
df.to_pandas().loc[0:5:-1]
#    a
# 0  3
# 4  2

The above issues apply, mutatis mutandis to almost all index types in almost all cases (although I have not exhaustively checked). Amusingly the monotone increasing case with positive step slice with endpoint not in the index works for FloatXXXIndex.

Expected behavior

We should be consistent, both internally, and probably matching pandas where possible (raising errors otherwise).

The text was updated successfully, but these errors were encountered:

When the indices are sorted, we need to a bit more work in get_slice_bound to handle those cases. But having done so correctly we can push the implementation of find_label_range to BaseIndex and just specialise get_slice_bound in specific implementations. - Closes rapidsai#12833

…3625) Much of the index-specific search code for label-based lookup only worked in the case where the index was sorted in ascending order and/or the requested slice had the same step sign as the index. To fix this, handle ascending and descending sorted indices specially, as well as refactoring to remove unused codepaths. The core idea is to push `find_first` and `find_last` to being generic column methods (in doing so we remove the `closest` argument, but that was only used to produce pandas-incompatible index behaviour). Lookup in indices then uses `get_slice_bound` (that can be specialised for index types) that uses first (if applicable) `searchsorted` and then `find_first/last`. While we're here, since we now check the sortedness properties of Index objects, turn them into `@cached_property` (partially addressing the request of #13357). - Closes #8693 - Closes #12833 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #13625

wence- added bug Something isn't working Needs Triage Need team to review and classify labels Feb 23, 2023

wence- self-assigned this Feb 23, 2023

This was referenced Feb 23, 2023

[ENH]: Reworking of iloc and loc indexing #12793

Open

[BUG] can't merge dask-cudf dataframes when index is non-numeric? #12773

Closed

wence- added this to the Pandas API Alignment and Coverage milestone Feb 23, 2023

wence- added improvement Improvement / enhancement to an existing function Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 2, 2023

wence- mentioned this issue Jun 27, 2023

Refactor Index search to simplify code and increase correctness #13625

Merged

3 tasks

rapids-bot bot closed this as completed in #13625 Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `loc`-based indexing with slice ranges inconsistent with pandas #12833

[BUG] `loc`-based indexing with slice ranges inconsistent with pandas #12833

wence- commented Feb 23, 2023

[BUG] loc-based indexing with slice ranges inconsistent with pandas #12833

[BUG] loc-based indexing with slice ranges inconsistent with pandas #12833

Comments

wence- commented Feb 23, 2023

[BUG] `loc`-based indexing with slice ranges inconsistent with pandas #12833

[BUG] `loc`-based indexing with slice ranges inconsistent with pandas #12833