-
Notifications
You must be signed in to change notification settings - Fork 927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python bindings for lists::contains
#7547
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7547 +/- ##
===============================================
+ Coverage 81.86% 82.38% +0.52%
===============================================
Files 101 101
Lines 16884 17353 +469
===============================================
+ Hits 13822 14297 +475
+ Misses 3062 3056 -6
Continue to review full report at Codecov.
|
python/cudf/cudf/tests/test_list.py
Outdated
@@ -112,3 +112,32 @@ def test_len(data): | |||
got = gsr.list.len() | |||
|
|||
assert_eq(expect, got, check_dtype=False) | |||
|
|||
|
|||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May need more test cases to cover:
cudf/cpp/include/cudf/lists/contains.hpp
Lines 37 to 41 in c1c60ba
* Output `column[i]` is set to null if one or more of the following are true: | |
* 1. The search key `search_key` is null | |
* 2. The list row `lists[i]` is null | |
* 3. The list row `lists[i]` does not contain the search key, and contains at least | |
* one null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1 and 2, what is the correct dtype for the null search_key
when constructing the Scalar?
i.e search_key= cudf.Scalar(value=None, dtype=?) .
Couldn't find it on cudf docs --> https://docs.rapids.ai/api/cudf/stable/basics.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about late reply. I think for libcudf even to work, they have to be:
- Of the same type of your list column elements
cudf/cpp/src/lists/contains.cu
Lines 147 to 148 in a568432
CUDF_EXPECTS(lists.child().type() == search_key.type(), "Type/Scale of search key does not match list column element type."); - Is one these types
cudf/cpp/src/lists/contains.cu
Lines 57 to 58 in a568432
cudf::is_numeric<ElementType>() || cudf::is_chrono<ElementType>() || cudf::is_fixed_point<ElementType>() || std::is_same<ElementType, cudf::string_view>::value;
But then another question to ask is, what error should be raised if the user passes in different types of scalars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading @mythrocks 's PR thread, it looks like these behaviors are to align with SQL's array_contain
. Should cudf follow the same semantics? @kkraus14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think requiring the types to match in the Cython function is absolutely reasonable. If we want to handle automatically typecasting we can do so from the Python and do it in a follow up.
I figure a TypeError
would be most appropriate to raise here, but since libcudf already does it for us, we could do so via a try / except
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I would like to discuss about the use of try/except
here. My argument against it is that checking the datatype of a column is almost trivial, as compared to one like null_counts
. So a double check is acceptable here. The downside of try/except
is that python starts to depend on the error strings that libcudf throws - which is rather a random match at this stage and libcudf has no guarantee of the actual string being thrown (besides cudf::logic_error
type).
So I suggest we do explicit python check if possible and resort to try/except when the check is non-trivial. I'm open to suggestions. cc @shwina
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try...except would be Pythonic if we could catch anything more specific than a generic RuntimeError. It seems like the only mechanism libcudf has to communicate any more information than "something went wrong" is the error message itself. On the other hand, relying on string matching seems fragile, especially given that error messages aren't really guaranteed to remain the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Dup of PM) It would be great if libcudf has a set error classes that inherits std::runtime_error
, which can be captured by Cython if thrown. These error classes can maintain some common error interfaces like wrong types; out of bounds access; unsupported operation etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
rerun tests |
1 similar comment
rerun tests |
Looks like there's a few failures:
|
I just fixed it, you can re-run the tests now 👍 |
@gpucibot merge |
No description provided.