Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer to iloc when indexer is an integer-like object and the Index is non-numeric dtype #7897

Merged
merged 17 commits into from
Apr 19, 2021

Conversation

skirui-source
Copy link
Contributor

@skirui-source skirui-source commented Apr 7, 2021

Pandas interprets idx in the expression sr[idx] as an absolute position in the series sr when idx's dtype is different from that of sr's Index.

In Pandas, the indexing takes both an integer and a string as the index:

>>> import pandas as pd
>>> x = pd.Series([1,2,3], index=pd.Index(["a", "b", "c"]))
>>> x["b"]
2
>>> x[1]
2

Whereas cuDF treats idx as a value to look up in sr's Index, which can lead to different behaviors when indices have non-integral dtypes:

>>> import cudf
>>> x = cudf.Series([1,2,3], index=cudf.Index(["a", "b", "c"]))
>>> x["b"]
2
>>> x[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/series.py", line 921, in __getitem__
    return self.loc[arg]
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/indexing.py", line 120, in __getitem__
    raise KeyError(arg)
KeyError: 1

This PR fixes the mismatch behavior in cuDF by deferring to iloc when a Series has a non-numerical Index and the indexer idx is an integer-like value : int, cudf Scalar, numpy int [np.int8, np.uint32, int64,,,]

Fixes: #7622
Replaces: #7775

@skirui-source skirui-source added bug Something isn't working 2 - In Progress Currently a work in progress Python Affects Python cuDF API. labels Apr 7, 2021
@skirui-source skirui-source self-assigned this Apr 7, 2021
@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 7, 2021
@skirui-source skirui-source removed CMake CMake build issue conda Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 7, 2021
@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Apr 7, 2021
@skirui-source skirui-source changed the base branch from branch-0.19 to branch-0.20 April 7, 2021 20:58
@codecov
Copy link

codecov bot commented Apr 7, 2021

Codecov Report

Merging #7897 (afcf79a) into branch-0.20 (51336df) will increase coverage by 0.02%.
The diff coverage is 88.88%.

❗ Current head afcf79a differs from pull request most recent head c06335f. Consider uploading reports for the commit c06335f to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #7897      +/-   ##
===============================================
+ Coverage        82.88%   82.91%   +0.02%     
===============================================
  Files              103      103              
  Lines            17668    17660       -8     
===============================================
- Hits             14645    14643       -2     
+ Misses            3023     3017       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/utils/cudautils.py 57.75% <25.00%> (ø)
python/cudf/cudf/core/column/column.py 88.64% <71.42%> (ø)
python/cudf/cudf/core/column/numerical.py 94.43% <72.72%> (ø)
python/cudf/cudf/core/dataframe.py 90.87% <83.33%> (+0.01%) ⬆️
python/cudf/cudf/utils/utils.py 89.53% <91.66%> (+0.02%) ⬆️
python/cudf/cudf/core/column/datetime.py 89.91% <100.00%> (ø)
python/cudf/cudf/core/column/timedelta.py 88.66% <100.00%> (ø)
python/cudf/cudf/core/groupby/groupby.py 92.28% <100.00%> (+0.83%) ⬆️
python/cudf/cudf/core/index.py 93.07% <100.00%> (ø)
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3327f7b...c06335f. Read the comment docs.

@skirui-source skirui-source changed the base branch from branch-0.20 to branch-0.19 April 8, 2021 18:53
@skirui-source skirui-source changed the base branch from branch-0.19 to branch-0.20 April 8, 2021 21:20
@skirui-source skirui-source marked this pull request as ready for review April 9, 2021 17:45
@skirui-source skirui-source requested a review from a team as a code owner April 9, 2021 17:45
@vyasr
Copy link
Contributor

vyasr commented Apr 14, 2021

Side note: for titles and descriptions, let's be descriptive of what's introduced in the PR. When github squash merges, all that's left for the PR is the code, the title and the description. It's important we keep a trace of what's introduced. Example: 8dc1559

It also doesn't have to be an echo of the issue you are addressing to. Sometimes your PR addresses part of the issue, other times it adds more stuff outside of the scope of the issue.

@isVoid very good point, maybe post this in Slack to make it more visible to everyone? Especially with regard to the PR description, often people write initial descriptions with questions for the reviewers or requests for help, so it would be good to have the final description updated before PRs get merged.

@skirui-source skirui-source added the non-breaking Non-breaking change label Apr 14, 2021
python/cudf/cudf/tests/test_indexing.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_indexing.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_indexing.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_indexing.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_indexing.py Outdated Show resolved Hide resolved
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@isVoid isVoid changed the title Defer to iloc when indexer is an integer-like object and the Index is non-integer dtype Defer to iloc when indexer is an integer-like object and the Index is non-numeric dtype Apr 15, 2021
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Apr 17, 2021
@kkraus14
Copy link
Collaborator

rerun tests

@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 3d212b3 into rapidsai:branch-0.20 Apr 19, 2021
@skirui-source skirui-source deleted the idexpress branch May 6, 2021 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Indexing expressions on series produce results different from Pandas'
4 participants