Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Index search to simplify code and increase correctness #13625

Merged
merged 15 commits into from
Jun 30, 2023

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Jun 27, 2023

Description

Much of the index-specific search code for label-based lookup only worked in the case where the index was sorted in ascending order and/or the requested slice had the same step sign as the index. To fix this, handle ascending and descending sorted indices specially, as well as refactoring to remove unused codepaths.

The core idea is to push find_first and find_last to being generic column methods (in doing so we remove the closest argument, but that was only used to produce pandas-incompatible index behaviour).

Lookup in indices then uses get_slice_bound (that can be specialised for index types) that uses first (if applicable) searchsorted and then find_first/last.

While we're here, since we now check the sortedness properties of Index objects, turn them into @cached_property (partially addressing the request of #13357).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

wence- added 5 commits June 27, 2023 16:38
The generic ColumnBase class can implement find_first and find_last by
asking for the indices of a given value. It is this latter which must
be implemented in a subclass-specific way.

Also remove the closest argument and implementation of get_slice_bounds
which is an Index-specific function.
Indices are immutable, so this property can be cached.

We will use this to implement faster and more correct searches.
With specialised handling for RangeIndex
@wence- wence- added bug Something isn't working Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function labels Jun 27, 2023
@wence- wence- requested a review from a team as a code owner June 27, 2023 15:53
@wence- wence- requested review from bdice and galipremsagar June 27, 2023 15:53
@wence- wence- added the non-breaking Non-breaking change label Jun 27, 2023
@wence-
Copy link
Contributor Author

wence- commented Jun 27, 2023

This is partially prep work for the followon to #13534 that will implement loc-getitem with the same approach.

wence- added 2 commits June 27, 2023 16:56
When the indices are sorted, we need to a bit more work in
get_slice_bound to handle those cases. But having done so correctly we
can push the implementation of find_label_range to BaseIndex and just
specialise get_slice_bound in specific implementations.

- Closes rapidsai#12833
Now that we have indices_of available on index objects, use that when
looking up scalar values in an index for loc-based indexing.

- Closes rapidsai#8693
@wence- wence- force-pushed the wence/fea/indexes branch from 901b31c to c7b22dd Compare June 27, 2023 15:56
@wence- wence- force-pushed the wence/fea/indexes branch from c7b22dd to 19f4e11 Compare June 27, 2023 16:03
@wence- wence- removed the improvement Improvement / enhancement to an existing function label Jun 28, 2023
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@wence-
Copy link
Contributor Author

wence- commented Jun 30, 2023

/merge

@rapids-bot rapids-bot bot merged commit c872400 into rapidsai:branch-23.08 Jun 30, 2023
@wence- wence- deleted the wence/fea/indexes branch June 30, 2023 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
2 participants