Parse (non-MultiIndex) label-based keys to structured data #13717

wence- · 2023-07-18T15:28:41Z

Description

Following on from #13534, this extends the scheme to handle label-based lookups as long as the index is not a multiindex.

As is the case for positional indexing, all of the different ways one can index a frame with labels eventually boil down to indexing by slice, boolean mask, map, or scalar in libcudf. loc-based keys are parsed into information that tags them by type. Since this information is the same as is used for iloc indexing, we can then dispatch to the same "internal" calls that don't do further bounds-checking or normalisation: rather than converting a label-based lookup to an argument we can pass to iloc-getitem (which must reinterpret it), we just take the decision straight away.

The next stage (which will help to remove a bunch of code) will be to handle multiindex keys, but that will be sufficiently complicated that I'd rather do it separately.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

wence- · 2023-07-18T15:36:25Z

Explicitly requested the same set of reviewers as #13534.

On the one hand, I would have liked to handle multiindex lookups as well in one change, but unfortunately I think the rules are too complicated and I'm still waiting for clarification on a bunch of ambiguities in the way pandas handles things. As a consequence this doesn't remove as much code as one would have hoped.

wence-

Some signposts

wence- · 2023-07-18T15:39:59Z

python/cudf/cudf/core/column_accessor.py

-    def _select_by_names(self, names: abc.Sequence) -> Self:
-        return self.__class__(
-            {key: self[key] for key in names},
-            multiindex=self.multiindex,
-            level_names=self.level_names,
-        )
-


I introduced this in #13534, but I realised it's better if the column key parsing returns a ColumnAccessor (rather than column names) so it's no longer necessary.

wence- · 2023-07-18T15:40:42Z

python/cudf/cudf/core/column_accessor.py

+        if len(set(keys)) != len(keys):
+            raise NotImplementedError(
+                "cudf DataFrames do not support repeated column names"
+            )


This now loudly raises when getting from a ColumnAccessor would produce duplicate column names

wence- · 2023-07-18T15:41:14Z

python/cudf/cudf/core/dataframe.py

-    def __getitem__(self, arg):
-        if (
-            isinstance(self._frame.index, MultiIndex)
-            or self._frame._data.multiindex
-        ):
-            # This try/except block allows the use of pandas-like
-            # tuple arguments into MultiIndex dataframes.
-            try:
-                return self._getitem_tuple_arg(arg)
-            except (TypeError, KeyError, IndexError, ValueError):
-                return self._getitem_tuple_arg((arg, slice(None)))
-        else:
-            if not isinstance(arg, tuple):
-                arg = (arg, slice(None))
-            return self._getitem_tuple_arg(arg)


No longer shared between loc and iloc cases.

wence- · 2023-07-18T15:42:11Z

python/cudf/cudf/core/dataframe.py

+            row_key, (
+                col_is_scalar,
+                ca,
+            ) = indexing_utils.destructure_dataframe_loc_indexer(
+                arg, self._frame
+            )
+            row_spec = indexing_utils.parse_row_loc_indexer(
+                row_key, self._frame.index
+            )
+            return self._frame._getitem_preprocessed(
+                row_spec, col_is_scalar, ca
+            )


This is the new approach (which doesn't handle multiindex lookups yet).

wence- · 2023-07-18T15:48:00Z

python/cudf/cudf/core/dataframe.py

-                pos_range = _get_label_range_or_mask(
-                    self._frame.index, key[0].start, key[0].stop, key[0].step
+                indexer = indexing_utils.find_label_range_or_mask(
+                    key[0], self._frame.index
                )
-                idx = self._frame.index[pos_range]
+                index = self._frame.index
+                if isinstance(indexer, indexing_utils.EmptyIndexer):
+                    idx = index[0:0:1]
+                elif isinstance(indexer, indexing_utils.SliceIndexer):
+                    idx = index[indexer.key]
+                else:
+                    idx = index[indexer.key.column]


I've moved _get_label_range_or_mask into indexing_utils and return structured data (which for now we must pull apart here because I haven't touched setitem yet).

wence- · 2023-07-18T16:14:58Z