-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bindings for index_of with column search key #10696
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.06 #10696 +/- ##
================================================
+ Coverage 86.35% 86.39% +0.04%
================================================
Files 142 142
Lines 22335 22306 -29
================================================
- Hits 19287 19272 -15
+ Misses 3048 3034 -14
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments attached.
search_key = cudf.Scalar(search_key) | ||
def index(self, search_key: Union[ScalarLike, ColumnLike]) -> ParentType: | ||
""" | ||
Return integers representing the index of the search key for each row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first line "brief" should be followed by a blank line before the longer summary.
Return integers representing the index of the search key for each row. | |
Return integers representing the index of the search key for each row. | |
If the search key is not contained in a row, return -1. | ||
If either the row or the search key are null, return <NA>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the search key is contained multiple times, does this return the smallest matching index?
If the search key is not contained in a row, return -1. | |
If either the row or the search key are null, return <NA>. | |
If the search key is not contained in a row, return -1. | |
If either the row or the search key are null, return <NA>. | |
If the search key is contained multiple times, the smallest matching | |
index is returned. |
try: | ||
res = self._return_or_inplace(index_of(self._column, search_key)) | ||
if is_scalar(search_key): | ||
res = self._return_or_inplace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no post-processing needed so I would return directly instead of saving a variable res
and returning later.
res = self._return_or_inplace( | |
return self._return_or_inplace( |
python/cudf/cudf/tests/test_list.py
Outdated
], | ||
) | ||
def test_index_invalid(data, scalar): | ||
def test_index_invalid(data, search_key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test for the invalid case where the search key is not the right length? e.g. len(sr) != len(search_key)
@@ -460,15 +471,21 @@ def test_contains_invalid(data, scalar): | |||
), | |||
], | |||
) | |||
def test_index(data, scalar, expect): | |||
def test_index(data, search_key, expect): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test case for multi-level nested data? (if that is supported)
sr = cudf.Series({"a": [[[1, 2], [3, 4]], [[5, 6], [7, 8]]]})
sr.list.index([[1, 2], [7, 8]]) # returns [0, 1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still curious about multi-level nesting here. If multi-level nesting is supported, we'll need to revise a few other items as well. e.g. is_scalar
might not be the appropriate check if "list scalars" are provided to check against a list of lists -- scalar-like input would have one fewer dimension / nested level that the input column, while column-like input would have an equal number of nested levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions about supported types.
Notes | ||
----- | ||
``index`` only supports list search operations on numeric types, | ||
decimals, chrono types, and strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Chrono" is a C++ / libcudf term and doesn't appear in the cuDF Python docs. The Python docs discuss datetimes and timedeltas.
To clarify my own understanding, what types are not supported here? It looks like this is exhaustive of the scalar types we support unless there's some catch for categorical or bool?
decimals, chrono types, and strings. | |
decimals, datetimes, timedeltas, and strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, upon further investigation it looks like bools are supported. But when using a list scalar as a search key on a multi-level nested list series index_of
throws:
List search operations are only supported on numeric types, decimals, chrono types, and strings.
Do you know how I might go about testing categorical types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your thorough testing! I'm not sure if "list of categorical" is a type that cuDF can represent. Given what you found, I think it would be okay to remove this note about limited type support. Essentially all scalar types that can be used in a list type are supported, from what I can tell.
Even though multi-level data is not supported, I am glad to see that multi-index searches fail with a real error message and not a complicated and cryptic traceback. 😄
@@ -460,15 +471,21 @@ def test_contains_invalid(data, scalar): | |||
), | |||
], | |||
) | |||
def test_index(data, scalar, expect): | |||
def test_index(data, search_key, expect): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still curious about multi-level nesting here. If multi-level nesting is supported, we'll need to revise a few other items as well. e.g. is_scalar
might not be the appropriate check if "list scalars" are provided to check against a list of lists -- scalar-like input would have one fewer dimension / nested level that the input column, while column-like input would have an equal number of nested levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! If you wish, I suggested one minor change:
Given what you found, I think it would be okay to remove this note about limited type support.
Test failures are unrelated. We'll fix the tests and merge this next week. |
rerun tests |
@gpucibot merge |
This adds bindings for
index_of
to enable usinglist.index
with a Series of search keys.Closes #10692
cc: @randerzander