Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bindings for index_of with column search key #10696

Merged
merged 5 commits into from
Apr 27, 2022

Conversation

ChrisJar
Copy link
Contributor

@ChrisJar ChrisJar commented Apr 20, 2022

This adds bindings for index_of to enable using list.index with a Series of search keys.

Closes #10692

cc: @randerzander

@ChrisJar ChrisJar requested a review from a team as a code owner April 20, 2022 21:48
@ChrisJar ChrisJar requested review from bdice and skirui-source April 20, 2022 21:48
@github-actions github-actions bot added the Python Affects Python cuDF API. label Apr 20, 2022
@bdice bdice added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 20, 2022
@codecov
Copy link

codecov bot commented Apr 20, 2022

Codecov Report

Merging #10696 (7341cbd) into branch-22.06 (65b1cbd) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.06   #10696      +/-   ##
================================================
+ Coverage         86.35%   86.39%   +0.04%     
================================================
  Files               142      142              
  Lines             22335    22306      -29     
================================================
- Hits              19287    19272      -15     
+ Misses             3048     3034      -14     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/lists.py 92.91% <100.00%> (+1.39%) ⬆️
python/cudf/cudf/api/types.py 89.36% <0.00%> (-0.44%) ⬇️
python/cudf/cudf/core/column/column.py 89.43% <0.00%> (-0.02%) ⬇️
python/cudf/cudf/core/frame.py 93.41% <0.00%> (ø)
python/cudf/cudf/core/index.py 92.31% <0.00%> (ø)
python/cudf/cudf/core/dtypes.py 97.30% <0.00%> (ø)
python/cudf/cudf/testing/dataset_generator.py 73.25% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 93.75% <0.00%> (+<0.01%) ⬆️
python/cudf/cudf/core/series.py 95.16% <0.00%> (+<0.01%) ⬆️
python/cudf/cudf/utils/utils.py 90.35% <0.00%> (+0.06%) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f6b70a...7341cbd. Read the comment docs.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments attached.

search_key = cudf.Scalar(search_key)
def index(self, search_key: Union[ScalarLike, ColumnLike]) -> ParentType:
"""
Return integers representing the index of the search key for each row.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line "brief" should be followed by a blank line before the longer summary.

Suggested change
Return integers representing the index of the search key for each row.
Return integers representing the index of the search key for each row.

Comment on lines 464 to 465
If the search key is not contained in a row, return -1.
If either the row or the search key are null, return <NA>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the search key is contained multiple times, does this return the smallest matching index?

Suggested change
If the search key is not contained in a row, return -1.
If either the row or the search key are null, return <NA>.
If the search key is not contained in a row, return -1.
If either the row or the search key are null, return <NA>.
If the search key is contained multiple times, the smallest matching
index is returned.

try:
res = self._return_or_inplace(index_of(self._column, search_key))
if is_scalar(search_key):
res = self._return_or_inplace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no post-processing needed so I would return directly instead of saving a variable res and returning later.

Suggested change
res = self._return_or_inplace(
return self._return_or_inplace(

],
)
def test_index_invalid(data, scalar):
def test_index_invalid(data, search_key):
Copy link
Contributor

@bdice bdice Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for the invalid case where the search key is not the right length? e.g. len(sr) != len(search_key)

@@ -460,15 +471,21 @@ def test_contains_invalid(data, scalar):
),
],
)
def test_index(data, scalar, expect):
def test_index(data, search_key, expect):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case for multi-level nested data? (if that is supported)

sr = cudf.Series({"a": [[[1, 2], [3, 4]], [[5, 6], [7, 8]]]})
sr.list.index([[1, 2], [7, 8]])  # returns [0, 1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still curious about multi-level nesting here. If multi-level nesting is supported, we'll need to revise a few other items as well. e.g. is_scalar might not be the appropriate check if "list scalars" are provided to check against a list of lists -- scalar-like input would have one fewer dimension / nested level that the input column, while column-like input would have an equal number of nested levels.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions about supported types.

Notes
-----
``index`` only supports list search operations on numeric types,
decimals, chrono types, and strings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Chrono" is a C++ / libcudf term and doesn't appear in the cuDF Python docs. The Python docs discuss datetimes and timedeltas.

To clarify my own understanding, what types are not supported here? It looks like this is exhaustive of the scalar types we support unless there's some catch for categorical or bool?

Suggested change
decimals, chrono types, and strings.
decimals, datetimes, timedeltas, and strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, upon further investigation it looks like bools are supported. But when using a list scalar as a search key on a multi-level nested list series index_of throws:
List search operations are only supported on numeric types, decimals, chrono types, and strings.

Do you know how I might go about testing categorical types?

Copy link
Contributor

@bdice bdice Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your thorough testing! I'm not sure if "list of categorical" is a type that cuDF can represent. Given what you found, I think it would be okay to remove this note about limited type support. Essentially all scalar types that can be used in a list type are supported, from what I can tell.

Even though multi-level data is not supported, I am glad to see that multi-index searches fail with a real error message and not a complicated and cryptic traceback. 😄

@@ -460,15 +471,21 @@ def test_contains_invalid(data, scalar):
),
],
)
def test_index(data, scalar, expect):
def test_index(data, search_key, expect):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still curious about multi-level nesting here. If multi-level nesting is supported, we'll need to revise a few other items as well. e.g. is_scalar might not be the appropriate check if "list scalars" are provided to check against a list of lists -- scalar-like input would have one fewer dimension / nested level that the input column, while column-like input would have an equal number of nested levels.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! If you wish, I suggested one minor change:

Given what you found, I think it would be okay to remove this note about limited type support.

@bdice
Copy link
Contributor

bdice commented Apr 23, 2022

Test failures are unrelated. We'll fix the tests and merge this next week.

@bdice bdice added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Apr 23, 2022
@bdice
Copy link
Contributor

bdice commented Apr 27, 2022

rerun tests

@bdice
Copy link
Contributor

bdice commented Apr 27, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 09995a5 into rapidsai:branch-22.06 Apr 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support passing a Series to list.index()
3 participants