Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Indexing expressions on series produce results different from Pandas' #7622

Closed
magnatelee opened this issue Mar 17, 2021 · 0 comments · Fixed by #7897
Closed

[BUG] Indexing expressions on series produce results different from Pandas' #7622

magnatelee opened this issue Mar 17, 2021 · 0 comments · Fixed by #7897
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@magnatelee
Copy link
Contributor

Describe the bug
Pandas interprets the sub-expression idx of an expression sr[idx] as an absolute position in the series sr when idx's dtype is different from that of sr's index. However, cuDF always treats idx as a value to look up in sr's index, which can lead to different behaviors when indices have non-integral dtypes:

def __getitem__(self, arg):
if isinstance(arg, slice):
return self.iloc[arg]
else:
return self.loc[arg]

Steps/Code to reproduce bug

In Pandas, the indexing takes both an integer and a string as the index:

>>> import pandas as pd
>>> x = pd.Series([1,2,3], index=pd.Index(["a", "b", "c"]))
>>> x["b"]
2
>>> x[1]
2

For the same example, cuDF raises a KeyError on the second access:

>>> import cudf
>>> x = cudf.Series([1,2,3], index=cudf.Index(["a", "b", "c"]))
>>> x["b"]
2
>>> x[1]
Traceback (most recent call last):
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/indexing.py", line 138, in _loc_to_iloc
    arg, closest=False
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/column/string.py", line 4938, in find_first_value
    return self._find_first_and_last(value)[0]
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/column/string.py", line 4933, in _find_first_and_last
    first = column.as_column(found_indices).find_first_value(1)
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/column/numerical.py", line 477, in find_first_value
    raise ValueError("value not found")
ValueError: value not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/indexing.py", line 118, in __getitem__
    arg = self._loc_to_iloc(arg)
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/indexing.py", line 142, in _loc_to_iloc
    raise KeyError("label scalar is out of bound")
KeyError: 'label scalar is out of bound'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/series.py", line 921, in __getitem__
    return self.loc[arg]
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/indexing.py", line 120, in __getitem__
    raise KeyError(arg)
KeyError: 1
>>> 
@magnatelee magnatelee added bug Something isn't working Needs Triage Need team to review and classify labels Mar 17, 2021
@skirui-source skirui-source self-assigned this Mar 23, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Mar 26, 2021
rapids-bot bot pushed a commit that referenced this issue Apr 19, 2021
…` is non-numeric dtype (#7897)

Pandas interprets  `idx` in the expression `sr[idx]` as an absolute position in the series `sr` when `idx`'s `dtype` is different from that of `sr`'s `Index`. 

In Pandas, the indexing takes both an integer and a string as the index:

```
>>> import pandas as pd
>>> x = pd.Series([1,2,3], index=pd.Index(["a", "b", "c"]))
>>> x["b"]
2
>>> x[1]
2

```

Whereas cuDF  treats `idx `as a value to look up in `sr`'s Index, which can lead to different behaviors when indices have non-integral dtypes:

```
>>> import cudf
>>> x = cudf.Series([1,2,3], index=cudf.Index(["a", "b", "c"]))
>>> x["b"]
2
>>> x[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/series.py", line 921, in __getitem__
    return self.loc[arg]
  File "/home/nfs/wonchanl/anaconda3/envs/rapids-tpcx-bb/lib/python3.7/site-packages/cudf/core/indexing.py", line 120, in __getitem__
    raise KeyError(arg)
KeyError: 1

```







This PR fixes  the mismatch behavior in cuDF by deferring to `iloc` when a Series has a non-numerical Index and the indexer `idx `is an integer-like value ` : int, cudf Scalar, numpy int [np.int8, np.uint32, int64 `,,,] 






Fixes: #7622
Replaces: #7775

Authors:
  - Sheilah Kirui (https://github.com/skirui-source)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Keith Kraus (https://github.com/kkraus14)

URL: #7897
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
3 participants