Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] iloc indexing with a Categorical not possible #13013

Closed
Tracked by #12793
wence- opened this issue Mar 27, 2023 · 4 comments · Fixed by #13534
Closed
Tracked by #12793

[BUG] iloc indexing with a Categorical not possible #13013

wence- opened this issue Mar 27, 2023 · 4 comments · Fixed by #13534
Assignees
Labels
bug Something isn't working improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Mar 27, 2023

Describe the bug

import cudf
import pandas as pd

f = pd.Series([0, 1, 2])
index1 = pd.Categorical([1])
f.iloc[index1] # => Length 1 series with the second row
index2 = pd.Categorical([0, 2])
f.iloc[index2] # => Length 2 series with first and third rows

cf = cudf.from_pandas(f)
cf.iloc[index1] # => TypeError: int() argument must be a string, a bytes-like object or a real number, not 'Categorical'
cf.iloc[index2] # => NotImplementedError: Unknown indexer <class 'cudf.core.column.categorical.CategoricalColumn'>

The former error is because _is_scalar_or_zero_d_array treats categoricals of length one as scalars (prior to vyasr@6251122 all categoricals were treated as scalars). Which, as observed above, is in conflict with pandas.

The latter is an error because casting into a an integer array doesn't work, and so the indexer ends up not knowing what to do.

Expected behavior

I'm not sure, the implicit casting to integers for iloc-based indexing of an object that might well not contain integers feels like an accident of pandas history. The pandas documentation says (of iloc indexing):

The .iloc attribute is the primary access method. The following are valid inputs:
[...]

  • A list or array of integers [4, 3, 0].

I think a categorical is not a list of integers, so this should be a TypeError, but it just happens to work because np.asarray(whatever).astype(int) works as long as whatever is coercible to int (this is broad e.g. f.iloc[[1.5]] works!?).

@mroeschke do you happen to know if iloc-based indexing with categoricals is "definitely supported" or "accident of history"?

@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify labels Mar 27, 2023
@wence- wence- self-assigned this Mar 27, 2023
@mroeschke
Copy link
Contributor

do you happen to know if iloc-based indexing with categoricals is "definitely supported" or "accident of history"?

pd.Categorical is "array-like" so I would expect a pd.Categorical of integers to work in iloc in pandas

@wence-
Copy link
Contributor Author

wence- commented Mar 28, 2023

Thanks!

@wence-
Copy link
Contributor Author

wence- commented Mar 28, 2023

pd.Categorical is "array-like" so I would expect a pd.Categorical of integers to work in iloc in pandas

What about a categorical of floats?

import pandas as pd
s = pd.Series([0, 1, 2])
s.iloc[pd.Categorical([1.5])]
# => 1 1

@mroeschke
Copy link
Contributor

What about a categorical of floats?

Yeah I would expect this to raise an exception ( I think pandas has an existing issue of passing an array-like of floats to iloc with an integer index )

@wence- wence- added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function and removed Needs Triage Need team to review and classify labels May 2, 2023
wence- added a commit to wence-/cudf that referenced this issue May 10, 2023
This is now purely location based, with no handling of multiindices in
the case of dataframes, and correct treatment of tuple arguments.

- Closes rapidsai#13015
- Closes rapidsai#13013
wence- added a commit to wence-/cudf that referenced this issue Jun 8, 2023
To simplify the low-level implementation of iloc-based getitem on both
Series and DataFrames, change the dispatching approach to parse the
user-provided "unstructured" key into structured data (a tagged
union using an enum + tuple). At the libcudf level, there are four
styles of indexing we can do:

1. index by slice
2. index by mask
3. index by map
4. index by scalar

iloc keys are parsed into information that tags them by type and
normalises the key to an appropriate column or other low-level object.

This centralises the business logic for index parsing in a
single place, and ensures that downstream consumers of the validated
and normalised indexer don't need to inspect it again to determine
what to do. Note that we treat index by scalar as composition of index
by map with get_element (since that simplifies the logic when
extracting the single row of a dataframe: we want to keep it on
device), but the scalar "type tag" allows us to determine this
unambiguously without reinspecting the key.

The major benefits will come when updating loc-based getitem (where
the parsing rules are more complicated, but eventually turn into one
of the above four cases). In this latter case, we will no longer
attempt to turn a loc-based key into a "user-facing" key for iloc, but
rather will call directly into the pre-parsed interface.

That said, we already provide some performance improvements since we
only do inspection once.

- Closes rapidsai#13013
- Closes rapidsai#13267
- Closes rapidsai#13515
wence- added a commit to wence-/cudf that referenced this issue Jun 23, 2023
To simplify the low-level implementation of iloc-based getitem on both
Series and DataFrames, change the dispatching approach to parse the
user-provided "unstructured" key into structured data (a tagged
union using an enum + tuple). At the libcudf level, there are four
styles of indexing we can do:

1. index by slice
2. index by mask
3. index by map
4. index by scalar

iloc keys are parsed into information that tags them by type and
normalises the key to an appropriate column or other low-level object.

This centralises the business logic for index parsing in a
single place, and ensures that downstream consumers of the validated
and normalised indexer don't need to inspect it again to determine
what to do. Note that we treat index by scalar as composition of index
by map with get_element (since that simplifies the logic when
extracting the single row of a dataframe: we want to keep it on
device), but the scalar "type tag" allows us to determine this
unambiguously without reinspecting the key.

The major benefits will come when updating loc-based getitem (where
the parsing rules are more complicated, but eventually turn into one
of the above four cases). In this latter case, we will no longer
attempt to turn a loc-based key into a "user-facing" key for iloc, but
rather will call directly into the pre-parsed interface.

That said, we already provide some performance improvements since we
only do inspection once.

- Closes rapidsai#13013
- Closes rapidsai#13267
- Closes rapidsai#13515
rapids-bot bot pushed a commit that referenced this issue Jul 14, 2023
To simplify the low-level implementation of iloc-based getitem on both
Series and DataFrames, change the dispatching approach to parse the
user-provided "unstructured" key into structured data (an appropriate
tagged union using new dataclasses). At the libcudf level, there are four
styles of indexing we can do:

1. index by slice
2. index by mask
3. index by map
4. index by scalar

iloc keys are parsed into information that tags them by type and
normalises the key to an appropriate column or other low-level object.

This centralises the business logic for index parsing in a
single place, and ensures that downstream consumers of the validated
and normalised indexer don't need to inspect it again to determine
what to do. Note that we treat index by scalar as composition of index
by map with get_element (since that simplifies the logic when
extracting the single row of a dataframe: we want to keep it on
device), but the scalar "type tag" allows us to determine this
unambiguously without reinspecting the key.

The major benefits will come when updating loc-based getitem (where
the parsing rules are more complicated, but eventually turn into one
of the above four cases). In this latter case, we will no longer
attempt to turn a loc-based key into a "user-facing" key for iloc, but
rather will call directly into the pre-parsed interface.

That said, we already provide some performance improvements since we
only do inspection once.

- Closes #13013
- Closes #13267
- Closes #13515

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Ashwin Srinath (https://github.com/shwina)
  - Bradley Dice (https://github.com/bdice)

URL: #13534
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants