-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse (non-MultiIndex) label-based keys to structured data #13717
base: branch-23.10
Are you sure you want to change the base?
Parse (non-MultiIndex) label-based keys to structured data #13717
Conversation
Explicitly requested the same set of reviewers as #13534. On the one hand, I would have liked to handle multiindex lookups as well in one change, but unfortunately I think the rules are too complicated and I'm still waiting for clarification on a bunch of ambiguities in the way pandas handles things. As a consequence this doesn't remove as much code as one would have hoped. |
15c955e
to
c5f8eb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some signposts
def _select_by_names(self, names: abc.Sequence) -> Self: | ||
return self.__class__( | ||
{key: self[key] for key in names}, | ||
multiindex=self.multiindex, | ||
level_names=self.level_names, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced this in #13534, but I realised it's better if the column key parsing returns a ColumnAccessor
(rather than column names) so it's no longer necessary.
if len(set(keys)) != len(keys): | ||
raise NotImplementedError( | ||
"cudf DataFrames do not support repeated column names" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now loudly raises when getting from a ColumnAccessor
would produce duplicate column names
def __getitem__(self, arg): | ||
if ( | ||
isinstance(self._frame.index, MultiIndex) | ||
or self._frame._data.multiindex | ||
): | ||
# This try/except block allows the use of pandas-like | ||
# tuple arguments into MultiIndex dataframes. | ||
try: | ||
return self._getitem_tuple_arg(arg) | ||
except (TypeError, KeyError, IndexError, ValueError): | ||
return self._getitem_tuple_arg((arg, slice(None))) | ||
else: | ||
if not isinstance(arg, tuple): | ||
arg = (arg, slice(None)) | ||
return self._getitem_tuple_arg(arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer shared between loc
and iloc
cases.
row_key, ( | ||
col_is_scalar, | ||
ca, | ||
) = indexing_utils.destructure_dataframe_loc_indexer( | ||
arg, self._frame | ||
) | ||
row_spec = indexing_utils.parse_row_loc_indexer( | ||
row_key, self._frame.index | ||
) | ||
return self._frame._getitem_preprocessed( | ||
row_spec, col_is_scalar, ca | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the new approach (which doesn't handle multiindex lookups yet).
pos_range = _get_label_range_or_mask( | ||
self._frame.index, key[0].start, key[0].stop, key[0].step | ||
indexer = indexing_utils.find_label_range_or_mask( | ||
key[0], self._frame.index | ||
) | ||
idx = self._frame.index[pos_range] | ||
index = self._frame.index | ||
if isinstance(indexer, indexing_utils.EmptyIndexer): | ||
idx = index[0:0:1] | ||
elif isinstance(indexer, indexing_utils.SliceIndexer): | ||
idx = index[indexer.key] | ||
else: | ||
idx = index[indexer.key.column] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved _get_label_range_or_mask
into indexing_utils
and return structured data (which for now we must pull apart here because I haven't touched setitem yet).
parsed_key = index.find_label_range(key) | ||
if len(range(len(index))[parsed_key]) == 0: | ||
return EmptyIndexer() | ||
else: | ||
return SliceIndexer(parsed_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only bit that is relevant for pandas-2 compat.
# TODO: is this right? | ||
key = key._get_decategorized_column() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is right. If the key is a categorical and the index is not, we should treat the key as "array-like" a decategorise it before looking up the values it encodes.
if isinstance(index, cudf.DatetimeIndex): | ||
# Try to turn strings into datetimes | ||
key = cudf.core.column.as_column(key, dtype=index.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be astype
?
return GatherMap.from_column_unchecked( | ||
rgather, len(haystack), nullify=False | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice here how we don't need to bounds-check the gather map (since the merge guarantees everything is in-bounds).
raise NotImplementedError( | ||
"This code path is not designed for multiindices" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two things that need to happen in this case:
- individual keys can have more cases (e.g.
(a, b):(c, d)
is a valid slice for a multiindex if the multiindex is ordered), a list-like can be not just single-level labels, but tuples of labels) - If we get a tuple of per-level indexers we must combine them to a single output indexer and the semantics of this combination in pandas are ambiguous in many cases.
Raise NotImplementedError if indexing would try and produce a new object with duplicate key names. This avoids silently losing information.
Rather than just returning column names, since we will immediately construct the ColumnAccessor, make that and return it.
If the index is non-numeric, Series.__getitem__ should fall back to positional lookup for integer keys.
Similar to positional indexing, implement parsing to structured IndexingSpec data for label indexing. This does not yet treat MultiIndex lookups for which the rules for combining the multi-level keys are more complicated.
Trying to move all the special-case handling into one place. Additionally, handle step != 1 correctly for datetime indexing.
Perhaps not worth it...
ed805cd
to
546fef5
Compare
/ok to test |
@wence- did this stall out just waiting for reviews? If so, I'm very sorry I lost track of it. Please bring it up to date and I'm happy to review this ASAP (probably for 24.08 not 24.06 at this point though). |
Sorry, it got rather struck on my end, I will try and resurrect for 24.10 |
Description
Following on from #13534, this extends the scheme to handle label-based lookups as long as the index is not a multiindex.
As is the case for positional indexing, all of the different ways one can index a frame with labels eventually boil down to indexing by slice, boolean mask, map, or scalar in libcudf.
loc
-based keys are parsed into information that tags them by type. Since this information is the same as is used foriloc
indexing, we can then dispatch to the same "internal" calls that don't do further bounds-checking or normalisation: rather than converting a label-based lookup to an argument we can pass toiloc
-getitem (which must reinterpret it), we just take the decision straight away.The next stage (which will help to remove a bunch of code) will be to handle multiindex keys, but that will be sufficiently complicated that I'd rather do it separately.
Ellipsis
#13268loc
-based indexing of DataFrames silently discards missing keys if at least one key is present in indexer #13379Checklist