Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor isin implementations #10165

Merged
merged 12 commits into from
Feb 18, 2022
Merged

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Jan 28, 2022

This PR fixes a number of error cases around the implementation of isin, particularly involving categorical dtypes and index alignment when called on a DataFrame. It also makes significant changes to simplify and improve the performance of DataFrame.isin, resulting in a 10-40% speedup when called with a DataFrame or Series as the argument (depending on the data sizes).

@vyasr vyasr added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 28, 2022
@vyasr vyasr added this to the CuDF Python Refactoring milestone Jan 28, 2022
@vyasr vyasr self-assigned this Jan 28, 2022
@vyasr vyasr requested a review from a team as a code owner January 28, 2022 23:22
@vyasr vyasr requested review from cwharris and shwina January 28, 2022 23:22
@codecov
Copy link

codecov bot commented Jan 29, 2022

Codecov Report

Merging #10165 (f073c00) into branch-22.04 (a7d88cd) will increase coverage by 0.21%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10165      +/-   ##
================================================
+ Coverage         10.42%   10.64%   +0.21%     
================================================
  Files               119      122       +3     
  Lines             20603    20939     +336     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18711     +256     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <ø> (ø)
python/cudf/cudf/_version.py 0.00% <ø> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/methods.py 0.00% <ø> (ø)
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ec614ac...f073c00. Read the comment docs.

@brandon-b-miller
Copy link
Contributor

thanks for this very sorely needed PR @vyasr , I am reviewing this over the next few days. One general thought I had going over this is that I think we should create a file test_isin.py and relocate all the tests there. There are a lot of cases to cover and many of the tests are meant to test what happens with different data structures (Series, DataFrame, etc), and I think it's easier to just have them all in one place.

Co-authored-by: brandon-b-miller <[email protected]>
@vyasr
Copy link
Contributor Author

vyasr commented Feb 4, 2022

thanks for this very sorely needed PR @vyasr , I am reviewing this over the next few days. One general thought I had going over this is that I think we should create a file test_isin.py and relocate all the tests there. There are a lot of cases to cover and many of the tests are meant to test what happens with different data structures (Series, DataFrame, etc), and I think it's easier to just have them all in one place.

I'm not opposed, but I feel like this fits into our broader discussion in #9999 about how tests should be organized. Do we want a separate file for each type of functionality? We shouldn't make this discussion differently for different functions, otherwise it will become very difficult to know where to look for things.

Also, let me know when you've finished reviewing enough for me to come back to this!

@vyasr
Copy link
Contributor Author

vyasr commented Feb 9, 2022

rerun tests

@vyasr
Copy link
Contributor Author

vyasr commented Feb 18, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 858ab83 into rapidsai:branch-22.04 Feb 18, 2022
@vyasr vyasr deleted the refactor/isin branch March 9, 2022 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants