Add bindings for index_of with column search key #10696

ChrisJar · 2022-04-20T21:48:32Z

This adds bindings for index_of to enable using list.index with a Series of search keys.

python/cudf/cudf/core/column/lists.py

codecov · 2022-04-20T23:01:22Z

Codecov Report

Merging #10696 (7341cbd) into branch-22.06 (65b1cbd) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.06   #10696      +/-   ##
================================================
+ Coverage         86.35%   86.39%   +0.04%     
================================================
  Files               142      142              
  Lines             22335    22306      -29     
================================================
- Hits              19287    19272      -15     
+ Misses             3048     3034      -14

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/lists.py	`92.91% <100.00%> (+1.39%)`	⬆️
python/cudf/cudf/api/types.py	`89.36% <0.00%> (-0.44%)`	⬇️
python/cudf/cudf/core/column/column.py	`89.43% <0.00%> (-0.02%)`	⬇️
python/cudf/cudf/core/frame.py	`93.41% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`92.31% <0.00%> (ø)`
python/cudf/cudf/core/dtypes.py	`97.30% <0.00%> (ø)`
python/cudf/cudf/testing/dataset_generator.py	`73.25% <0.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`93.75% <0.00%> (+<0.01%)`	⬆️
python/cudf/cudf/core/series.py	`95.16% <0.00%> (+<0.01%)`	⬆️
python/cudf/cudf/utils/utils.py	`90.35% <0.00%> (+0.06%)`	⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f6b70a...7341cbd. Read the comment docs.

bdice

Comments attached.

bdice · 2022-04-21T15:04:43Z

python/cudf/cudf/core/column/lists.py

-        search_key = cudf.Scalar(search_key)
+    def index(self, search_key: Union[ScalarLike, ColumnLike]) -> ParentType:
+        """
+        Return integers representing the index of the search key for each row.


The first line "brief" should be followed by a blank line before the longer summary.

Suggested change

Return integers representing the index of the search key for each row.

Return integers representing the index of the search key for each row.

bdice · 2022-04-21T15:10:13Z

python/cudf/cudf/core/column/lists.py

+        If the search key is not contained in a row, return -1.
+        If either the row or the search key are null, return <NA>.


If the search key is contained multiple times, does this return the smallest matching index?

Suggested change

If the search key is not contained in a row, return -1.

If either the row or the search key are null, return <NA>.

If the search key is not contained in a row, return -1.

If either the row or the search key are null, return <NA>.

If the search key is contained multiple times, the smallest matching

index is returned.

bdice · 2022-04-21T15:36:50Z

python/cudf/cudf/core/column/lists.py

        try:
-            res = self._return_or_inplace(index_of(self._column, search_key))
+            if is_scalar(search_key):
+                res = self._return_or_inplace(


There's no post-processing needed so I would return directly instead of saving a variable res and returning later.

Suggested change

res = self._return_or_inplace(

return self._return_or_inplace(

bdice · 2022-04-21T15:39:54Z

python/cudf/cudf/tests/test_list.py

    ],
 )
-def test_index_invalid(data, scalar):
+def test_index_invalid(data, search_key):


Can we add a test for the invalid case where the search key is not the right length? e.g. len(sr) != len(search_key)

bdice · 2022-04-21T15:42:24Z

python/cudf/cudf/tests/test_list.py

@@ -460,15 +471,21 @@ def test_contains_invalid(data, scalar):
        ),
    ],
 )
-def test_index(data, scalar, expect):
+def test_index(data, search_key, expect):


Can we add a test case for multi-level nested data? (if that is supported)

sr = cudf.Series({"a": [[[1, 2], [3, 4]], [[5, 6], [7, 8]]]}) sr.list.index([[1, 2], [7, 8]]) # returns [0, 1]

Still curious about multi-level nesting here. If multi-level nesting is supported, we'll need to revise a few other items as well. e.g. is_scalar might not be the appropriate check if "list scalars" are provided to check against a list of lists -- scalar-like input would have one fewer dimension / nested level that the input column, while column-like input would have an equal number of nested levels.

bdice

Two questions about supported types.

bdice · 2022-04-22T21:13:13Z

python/cudf/cudf/core/column/lists.py

+        Notes
+        -----
+        ``index`` only supports list search operations on numeric types,
+        decimals, chrono types, and strings.


"Chrono" is a C++ / libcudf term and doesn't appear in the cuDF Python docs. The Python docs discuss datetimes and timedeltas.

To clarify my own understanding, what types are not supported here? It looks like this is exhaustive of the scalar types we support unless there's some catch for categorical or bool?

Suggested change

decimals, chrono types, and strings.

decimals, datetimes, timedeltas, and strings.

Yep, upon further investigation it looks like bools are supported. But when using a list scalar as a search key on a multi-level nested list series index_of throws:
List search operations are only supported on numeric types, decimals, chrono types, and strings.

Do you know how I might go about testing categorical types?

Thank you very much for your thorough testing! I'm not sure if "list of categorical" is a type that cuDF can represent. Given what you found, I think it would be okay to remove this note about limited type support. Essentially all scalar types that can be used in a list type are supported, from what I can tell.

Even though multi-level data is not supported, I am glad to see that multi-index searches fail with a real error message and not a complicated and cryptic traceback. 😄

bdice · 2022-04-22T21:15:51Z

python/cudf/cudf/tests/test_list.py

@@ -460,15 +471,21 @@ def test_contains_invalid(data, scalar):
        ),
    ],
 )
-def test_index(data, scalar, expect):
+def test_index(data, search_key, expect):


Still curious about multi-level nesting here. If multi-level nesting is supported, we'll need to revise a few other items as well. e.g. is_scalar might not be the appropriate check if "list scalars" are provided to check against a list of lists -- scalar-like input would have one fewer dimension / nested level that the input column, while column-like input would have an equal number of nested levels.

bdice

Excellent! If you wish, I suggested one minor change:

Given what you found, I think it would be okay to remove this note about limited type support.

bdice · 2022-04-23T17:11:20Z

Test failures are unrelated. We'll fix the tests and merge this next week.

bdice · 2022-04-27T16:09:50Z

rerun tests

bdice · 2022-04-27T18:13:26Z

@gpucibot merge

Add bindings for index_of with column search key

5b3adf3

ChrisJar requested a review from a team as a code owner April 20, 2022 21:48

ChrisJar requested review from bdice and skirui-source April 20, 2022 21:48

github-actions bot added the Python Affects Python cuDF API. label Apr 20, 2022

Fix style

70b8200

bdice reviewed Apr 20, 2022

View reviewed changes

python/cudf/cudf/core/column/lists.py Show resolved Hide resolved

bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 20, 2022

bdice assigned ChrisJar Apr 20, 2022

Add documentation

1495620

bdice reviewed Apr 21, 2022

View reviewed changes

Address reviews

577294a

shwina approved these changes Apr 22, 2022

View reviewed changes

bdice requested changes Apr 22, 2022

View reviewed changes

bdice approved these changes Apr 22, 2022

View reviewed changes

Remove note

7341cbd

bdice added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Apr 23, 2022

rapids-bot bot merged commit 09995a5 into rapidsai:branch-22.06 Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bindings for index_of with column search key #10696

Add bindings for index_of with column search key #10696

ChrisJar commented Apr 20, 2022 •

edited

Loading

codecov bot commented Apr 20, 2022 •

edited

Loading

bdice left a comment

bdice Apr 21, 2022

bdice Apr 21, 2022

bdice Apr 21, 2022

bdice Apr 21, 2022 •

edited

Loading

bdice Apr 21, 2022

bdice Apr 22, 2022

bdice left a comment

bdice Apr 22, 2022

ChrisJar Apr 22, 2022

bdice Apr 22, 2022 •

edited

Loading

bdice Apr 22, 2022

bdice left a comment •

edited

Loading

bdice commented Apr 23, 2022

bdice commented Apr 27, 2022

bdice commented Apr 27, 2022

	Return integers representing the index of the search key for each row.
	Return integers representing the index of the search key for each row.

		If the search key is not contained in a row, return -1.
		If either the row or the search key are null, return <NA>.

	res = self._return_or_inplace(
	return self._return_or_inplace(

	decimals, chrono types, and strings.
	decimals, datetimes, timedeltas, and strings.

Add bindings for index_of with column search key #10696

Add bindings for index_of with column search key #10696

Conversation

ChrisJar commented Apr 20, 2022 • edited Loading

codecov bot commented Apr 20, 2022 • edited Loading

Codecov Report

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice Apr 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

bdice commented Apr 23, 2022

bdice commented Apr 27, 2022

bdice commented Apr 27, 2022

ChrisJar commented Apr 20, 2022 •

edited

Loading

codecov bot commented Apr 20, 2022 •

edited

Loading

bdice Apr 21, 2022 •

edited

Loading

bdice Apr 22, 2022 •

edited

Loading

bdice left a comment •

edited

Loading