[BUG] iloc indexing with a Categorical not possible #13013

wence- · 2023-03-27T10:20:49Z

Describe the bug

import cudf
import pandas as pd

f = pd.Series([0, 1, 2])
index1 = pd.Categorical([1])
f.iloc[index1] # => Length 1 series with the second row
index2 = pd.Categorical([0, 2])
f.iloc[index2] # => Length 2 series with first and third rows

cf = cudf.from_pandas(f)
cf.iloc[index1] # => TypeError: int() argument must be a string, a bytes-like object or a real number, not 'Categorical'
cf.iloc[index2] # => NotImplementedError: Unknown indexer <class 'cudf.core.column.categorical.CategoricalColumn'>

The former error is because _is_scalar_or_zero_d_array treats categoricals of length one as scalars (prior to vyasr@6251122 all categoricals were treated as scalars). Which, as observed above, is in conflict with pandas.

The latter is an error because casting into a an integer array doesn't work, and so the indexer ends up not knowing what to do.

Expected behavior

I'm not sure, the implicit casting to integers for iloc-based indexing of an object that might well not contain integers feels like an accident of pandas history. The pandas documentation says (of iloc indexing):

The .iloc attribute is the primary access method. The following are valid inputs:
[...]

A list or array of integers [4, 3, 0].

I think a categorical is not a list of integers, so this should be a TypeError, but it just happens to work because np.asarray(whatever).astype(int) works as long as whatever is coercible to int (this is broad e.g. f.iloc[[1.5]] works!?).

@mroeschke do you happen to know if iloc-based indexing with categoricals is "definitely supported" or "accident of history"?

The text was updated successfully, but these errors were encountered:

mroeschke · 2023-03-27T22:12:24Z

do you happen to know if iloc-based indexing with categoricals is "definitely supported" or "accident of history"?

pd.Categorical is "array-like" so I would expect a pd.Categorical of integers to work in iloc in pandas

wence- · 2023-03-28T08:44:59Z

Thanks!

wence- · 2023-03-28T14:31:34Z

pd.Categorical is "array-like" so I would expect a pd.Categorical of integers to work in iloc in pandas

What about a categorical of floats?

import pandas as pd
s = pd.Series([0, 1, 2])
s.iloc[pd.Categorical([1.5])]
# => 1 1

mroeschke · 2023-03-28T18:08:31Z

What about a categorical of floats?

Yeah I would expect this to raise an exception ( I think pandas has an existing issue of passing an array-like of floats to iloc with an integer index )

This is now purely location based, with no handling of multiindices in the case of dataframes, and correct treatment of tuple arguments. - Closes rapidsai#13015 - Closes rapidsai#13013

To simplify the low-level implementation of iloc-based getitem on both Series and DataFrames, change the dispatching approach to parse the user-provided "unstructured" key into structured data (a tagged union using an enum + tuple). At the libcudf level, there are four styles of indexing we can do: 1. index by slice 2. index by mask 3. index by map 4. index by scalar iloc keys are parsed into information that tags them by type and normalises the key to an appropriate column or other low-level object. This centralises the business logic for index parsing in a single place, and ensures that downstream consumers of the validated and normalised indexer don't need to inspect it again to determine what to do. Note that we treat index by scalar as composition of index by map with get_element (since that simplifies the logic when extracting the single row of a dataframe: we want to keep it on device), but the scalar "type tag" allows us to determine this unambiguously without reinspecting the key. The major benefits will come when updating loc-based getitem (where the parsing rules are more complicated, but eventually turn into one of the above four cases). In this latter case, we will no longer attempt to turn a loc-based key into a "user-facing" key for iloc, but rather will call directly into the pre-parsed interface. That said, we already provide some performance improvements since we only do inspection once. - Closes rapidsai#13013 - Closes rapidsai#13267 - Closes rapidsai#13515

To simplify the low-level implementation of iloc-based getitem on both Series and DataFrames, change the dispatching approach to parse the user-provided "unstructured" key into structured data (an appropriate tagged union using new dataclasses). At the libcudf level, there are four styles of indexing we can do: 1. index by slice 2. index by mask 3. index by map 4. index by scalar iloc keys are parsed into information that tags them by type and normalises the key to an appropriate column or other low-level object. This centralises the business logic for index parsing in a single place, and ensures that downstream consumers of the validated and normalised indexer don't need to inspect it again to determine what to do. Note that we treat index by scalar as composition of index by map with get_element (since that simplifies the logic when extracting the single row of a dataframe: we want to keep it on device), but the scalar "type tag" allows us to determine this unambiguously without reinspecting the key. The major benefits will come when updating loc-based getitem (where the parsing rules are more complicated, but eventually turn into one of the above four cases). In this latter case, we will no longer attempt to turn a loc-based key into a "user-facing" key for iloc, but rather will call directly into the pre-parsed interface. That said, we already provide some performance improvements since we only do inspection once. - Closes #13013 - Closes #13267 - Closes #13515 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) URL: #13534

wence- added bug Something isn't working Needs Triage Need team to review and classify labels Mar 27, 2023

wence- self-assigned this Mar 27, 2023

This was referenced Mar 27, 2023

[BUG] loc indexing with length-1 Categorical not possible #13014

Open

[ENH]: Reworking of iloc and loc indexing #12793

Open

wence- added this to the Pandas API Alignment and Coverage milestone May 2, 2023

wence- added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function and removed Needs Triage Need team to review and classify labels May 2, 2023

wence- mentioned this issue May 11, 2023

Add tests of currently unsupported indexing #13338

Merged

3 tasks

wence- mentioned this issue Jun 8, 2023

Implement iloc-getitem using parse-don't-validate approach #13534

Merged

3 tasks

rapids-bot bot closed this as completed in #13534 Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] iloc indexing with a Categorical not possible #13013

[BUG] iloc indexing with a Categorical not possible #13013

wence- commented Mar 27, 2023

mroeschke commented Mar 27, 2023

wence- commented Mar 28, 2023

wence- commented Mar 28, 2023

mroeschke commented Mar 28, 2023

[BUG] iloc indexing with a Categorical not possible #13013

[BUG] iloc indexing with a Categorical not possible #13013

Comments

wence- commented Mar 27, 2023

mroeschke commented Mar 27, 2023

wence- commented Mar 28, 2023

wence- commented Mar 28, 2023

mroeschke commented Mar 28, 2023