Add Python bindings for `lists::contains` #7547

skirui-source · 2021-03-10T06:18:11Z

No description provided.

codecov · 2021-03-13T05:04:51Z

Codecov Report

Merging #7547 (961761e) into branch-0.19 (7871e7a) will increase coverage by 0.52%.
The diff coverage is 90.47%.

❗ Current head 961761e differs from pull request most recent head c784295. Consider uploading reports for the commit c784295 to get more accurate results

@@               Coverage Diff               @@
##           branch-0.19    #7547      +/-   ##
===============================================
+ Coverage        81.86%   82.38%   +0.52%     
===============================================
  Files              101      101              
  Lines            16884    17353     +469     
===============================================
+ Hits             13822    14297     +475     
+ Misses            3062     3056       -6

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/column.py	`87.83% <75.00%> (+0.07%)`	⬆️
python/cudf/cudf/core/column/decimal.py	`92.75% <90.00%> (-2.12%)`	⬇️
python/cudf/cudf/core/column/lists.py	`92.07% <100.00%> (+0.68%)`	⬆️
python/cudf/cudf/core/column/string.py	`86.76% <100.00%> (+0.26%)`	⬆️
python/cudf/cudf/utils/gpu_utils.py	`53.65% <0.00%> (-4.88%)`	⬇️
python/cudf/cudf/core/abc.py	`87.23% <0.00%> (-1.14%)`	⬇️
python/cudf/cudf/core/column/numerical.py	`94.85% <0.00%> (-0.17%)`	⬇️
python/cudf/cudf/io/feather.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/comm/serialize.py	`0.00% <0.00%> (ø)`
... and 48 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3136124...c784295. Read the comment docs.

isVoid · 2021-03-16T20:40:03Z

python/cudf/cudf/tests/test_list.py

@@ -112,3 +112,32 @@ def test_len(data):
    got = gsr.list.len()

    assert_eq(expect, got, check_dtype=False)
+
+
+@pytest.mark.parametrize(


May need more test cases to cover:

cudf/cpp/include/cudf/lists/contains.hpp

Lines 37 to 41 in c1c60ba

* Output `column[i]` is set to null if one or more of the following are true:

* 1. The search key `search_key` is null

* 2. The list row `lists[i]` is null

* 3. The list row `lists[i]` does not contain the search key, and contains at least

* one null.

For 1 and 2, what is the correct dtype for the null search_key when constructing the Scalar?
i.e search_key= cudf.Scalar(value=None, dtype=?) .

Couldn't find it on cudf docs --> https://docs.rapids.ai/api/cudf/stable/basics.html

Sorry about late reply. I think for libcudf even to work, they have to be:

Of the same type of your list column elements

cudf/cpp/src/lists/contains.cu

Lines 147 to 148 in a568432

CUDF_EXPECTS(lists.child().type() == search_key.type(),

"Type/Scale of search key does not match list column element type.");

Is one these types

cudf/cpp/src/lists/contains.cu

Lines 57 to 58 in a568432

cudf::is_numeric<ElementType>() || cudf::is_chrono<ElementType>() ||

cudf::is_fixed_point<ElementType>() || std::is_same<ElementType, cudf::string_view>::value;

But then another question to ask is, what error should be raised if the user passes in different types of scalars?

Reading @mythrocks 's PR thread, it looks like these behaviors are to align with SQL's array_contain. Should cudf follow the same semantics? @kkraus14

I think requiring the types to match in the Cython function is absolutely reasonable. If we want to handle automatically typecasting we can do so from the Python and do it in a follow up.

I figure a TypeError would be most appropriate to raise here, but since libcudf already does it for us, we could do so via a try / except?

Hmm, I would like to discuss about the use of try/except here. My argument against it is that checking the datatype of a column is almost trivial, as compared to one like null_counts. So a double check is acceptable here. The downside of try/except is that python starts to depend on the error strings that libcudf throws - which is rather a random match at this stage and libcudf has no guarantee of the actual string being thrown (besides cudf::logic_error type).

So I suggest we do explicit python check if possible and resort to try/except when the check is non-trivial. I'm open to suggestions. cc @shwina

try...except would be Pythonic if we could catch anything more specific than a generic RuntimeError. It seems like the only mechanism libcudf has to communicate any more information than "something went wrong" is the error message itself. On the other hand, relying on string matching seems fragile, especially given that error messages aren't really guaranteed to remain the same.

(Dup of PM) It would be great if libcudf has a set error classes that inherits std::runtime_error, which can be captured by Cython if thrown. These error classes can maintain some common error interfaces like wrong types; out of bounds access; unsupported operation etc.

python/cudf/cudf/core/column/lists.py

python/cudf/cudf/tests/test_list.py

python/cudf/cudf/_lib/lists.pyx

python/cudf/cudf/core/column/lists.py

…ontains

kkraus14

LGTM!

python/cudf/cudf/tests/test_list.py

kkraus14 · 2021-03-25T13:33:49Z

rerun tests

dillon-cullinan · 2021-03-25T15:32:54Z

rerun tests

kkraus14 · 2021-03-25T17:33:20Z

Looks like there's a few failures:

data = [[1, 2, 3], []], expect = 0    <NA>
1    <NA>
dtype: float64

    @pytest.mark.parametrize(
        "data, expect",
        [
            ([[1, 2, 3], []], [None, None],),
            ([[1.0, 2.0, 3.0], None, []], [None, None, None],),
            ([[None, 2, 3], [], None], [None, None, None],),
            ([[1, 2, 3], [3, 4, 5]], [None, None],),
            ([[], [], []], [None, None, None],),
        ],
    )
    def test_contains_null_search_key(data, expect):
        sr = cudf.Series(data)
        expect = cudf.Series(expect)
        got = sr.list.contains(cudf.Scalar(cudf.NA, sr.dtype.element_type))
>       assert_eq(expect, got)

cudf/tests/test_list.py:283: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = 0   NaN
1   NaN
dtype: float64, right = 0    None
1    None
dtype: object
kwargs = {}

    def assert_eq(left, right, **kwargs):
        """ Assert that two cudf-like things are equivalent
    
        This equality test works for pandas/cudf dataframes/series/indexes/scalars
        in the same way, and so makes it easier to perform parametrized testing
        without switching between assert_frame_equal/assert_series_equal/...
        functions.
        """
        if hasattr(left, "to_pandas"):
            left = left.to_pandas()
        if hasattr(right, "to_pandas"):
            right = right.to_pandas()
        if isinstance(left, cupy.ndarray):
            left = cupy.asnumpy(left)
        if isinstance(right, cupy.ndarray):
            right = cupy.asnumpy(right)
    
        if isinstance(left, pd.DataFrame):
            tm.assert_frame_equal(left, right, **kwargs)
        elif isinstance(left, pd.Series):
>           tm.assert_series_equal(left, right, **kwargs)
E           AssertionError: Attributes of Series are different
E           
E           Attribute "dtype" are different
E           [left]:  float64
E           [right]: object

cudf/tests/utils.py:89: AssertionError

skirui-source · 2021-03-25T17:37:22Z

Looks like there's a few failures:

data = [[1, 2, 3], []], expect = 0    <NA>
1    <NA>
dtype: float64

    @pytest.mark.parametrize(
        "data, expect",
        [
            ([[1, 2, 3], []], [None, None],),
            ([[1.0, 2.0, 3.0], None, []], [None, None, None],),
            ([[None, 2, 3], [], None], [None, None, None],),
            ([[1, 2, 3], [3, 4, 5]], [None, None],),
            ([[], [], []], [None, None, None],),
        ],
    )
    def test_contains_null_search_key(data, expect):
        sr = cudf.Series(data)
        expect = cudf.Series(expect)
        got = sr.list.contains(cudf.Scalar(cudf.NA, sr.dtype.element_type))
>       assert_eq(expect, got)

cudf/tests/test_list.py:283: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = 0   NaN
1   NaN
dtype: float64, right = 0    None
1    None
dtype: object
kwargs = {}

    def assert_eq(left, right, **kwargs):
        """ Assert that two cudf-like things are equivalent
    
        This equality test works for pandas/cudf dataframes/series/indexes/scalars
        in the same way, and so makes it easier to perform parametrized testing
        without switching between assert_frame_equal/assert_series_equal/...
        functions.
        """
        if hasattr(left, "to_pandas"):
            left = left.to_pandas()
        if hasattr(right, "to_pandas"):
            right = right.to_pandas()
        if isinstance(left, cupy.ndarray):
            left = cupy.asnumpy(left)
        if isinstance(right, cupy.ndarray):
            right = cupy.asnumpy(right)
    
        if isinstance(left, pd.DataFrame):
            tm.assert_frame_equal(left, right, **kwargs)
        elif isinstance(left, pd.Series):
>           tm.assert_series_equal(left, right, **kwargs)
E           AssertionError: Attributes of Series are different
E           
E           Attribute "dtype" are different
E           [left]:  float64
E           [right]: object

cudf/tests/utils.py:89: AssertionError

I just fixed it, you can re-run the tests now 👍

kkraus14 · 2021-03-25T17:46:25Z

@gpucibot merge

Added contains.pxd file

ffec377

skirui-source added feature request New feature or request Python Affects Python cuDF API. Cython non-breaking Non-breaking change labels Mar 10, 2021

skirui-source requested review from shwina and isVoid March 10, 2021 06:18

skirui-source self-assigned this Mar 10, 2021

skirui-source added 3 commits March 9, 2021 23:00

working on the python API interface for contains()

5498698

added contains_elements() in lists.pyx

82113fe

added python API and tests for scalar

961761e

skirui-source added the 3 - Ready for Review Ready for review by team label Mar 13, 2021

skirui-source marked this pull request as ready for review March 16, 2021 19:52

skirui-source requested a review from a team as a code owner March 16, 2021 19:52

isVoid reviewed Mar 16, 2021

View reviewed changes

skirui-source added 2 commits March 16, 2021 22:07

added tests for null key and null rows- cases

348d2eb

fix merge conflict

a3214c3

isVoid reviewed Mar 20, 2021

View reviewed changes

python/cudf/cudf/tests/test_list.py Outdated Show resolved Hide resolved

kkraus14 reviewed Mar 22, 2021

View reviewed changes

python/cudf/cudf/_lib/lists.pyx Outdated Show resolved Hide resolved

kkraus14 reviewed Mar 22, 2021

View reviewed changes

python/cudf/cudf/core/column/lists.py Outdated Show resolved Hide resolved

skirui-source added 2 commits March 22, 2021 17:46

address review comments. dtype assertion failing for two tests

831820b

.

8165cad

skirui-source requested a review from isVoid March 23, 2021 00:51

kkraus14 reviewed Mar 23, 2021

View reviewed changes

python/cudf/cudf/core/column/lists.py Outdated Show resolved Hide resolved

skirui-source and others added 5 commits March 22, 2021 19:55

Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into c…

898cf82

…ontains

fix merge conflict

47621b4

Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into c…

6b10220

…ontains

added try-except to handle typing

026450f

Merge branch 'branch-0.19' into contains

9d2fd1e

skirui-source requested a review from kkraus14 March 24, 2021 04:09

kkraus14 approved these changes Mar 24, 2021

View reviewed changes

isVoid approved these changes Mar 24, 2021

View reviewed changes

python/cudf/cudf/tests/test_list.py Outdated Show resolved Hide resolved

fix style issues

b18bc62

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 24, 2021

skirui-source added 2 commits March 24, 2021 22:03

Added couple more tests

e9bf1df

Merge branch 'contains' of github.com:skirui-source/cudf into contains

b28602a

kkraus14 added 0 - Waiting on Author Waiting for author to respond to review and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Mar 25, 2021

added bool dtype

c784295

skirui-source added 3 - Ready for Review Ready for review by team and removed 0 - Waiting on Author Waiting for author to respond to review labels Mar 25, 2021

kkraus14 approved these changes Mar 25, 2021

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 25, 2021

skirui-source linked an issue Mar 25, 2021 that may be closed by this pull request

[FEA] Python bindings for lists::contains #7462

Closed

rapids-bot bot merged commit 000978e into rapidsai:branch-0.19 Mar 25, 2021

skirui-source deleted the contains branch March 25, 2021 20:31

skirui-source restored the contains branch March 25, 2021 20:36

skirui-source deleted the contains branch May 6, 2021 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Python bindings for `lists::contains` #7547

Add Python bindings for `lists::contains` #7547

skirui-source commented Mar 10, 2021

codecov bot commented Mar 13, 2021 •

edited

Loading

isVoid Mar 16, 2021

skirui-source Mar 17, 2021 •

edited

Loading

isVoid Mar 18, 2021

isVoid Mar 20, 2021

kkraus14 Mar 20, 2021

isVoid Mar 23, 2021

shwina Mar 23, 2021

isVoid Mar 24, 2021

kkraus14 left a comment

kkraus14 commented Mar 25, 2021

dillon-cullinan commented Mar 25, 2021

kkraus14 commented Mar 25, 2021

skirui-source commented Mar 25, 2021 •

edited

Loading

kkraus14 commented Mar 25, 2021

	* Output `column[i]` is set to null if one or more of the following are true:
	* 1. The search key `search_key` is null
	* 2. The list row `lists[i]` is null
	* 3. The list row `lists[i]` does not contain the search key, and contains at least
	* one null.

	CUDF_EXPECTS(lists.child().type() == search_key.type(),
	"Type/Scale of search key does not match list column element type.");

	cudf::is_numeric<ElementType>() \|\| cudf::is_chrono<ElementType>() \|\|
	cudf::is_fixed_point<ElementType>() \|\| std::is_same<ElementType, cudf::string_view>::value;

Add Python bindings for lists::contains #7547

Add Python bindings for lists::contains #7547

Conversation

skirui-source commented Mar 10, 2021

codecov bot commented Mar 13, 2021 • edited Loading

Codecov Report

isVoid Mar 16, 2021

Choose a reason for hiding this comment

skirui-source Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

isVoid Mar 18, 2021

Choose a reason for hiding this comment

isVoid Mar 20, 2021

Choose a reason for hiding this comment

kkraus14 Mar 20, 2021

Choose a reason for hiding this comment

isVoid Mar 23, 2021

Choose a reason for hiding this comment

shwina Mar 23, 2021

Choose a reason for hiding this comment

isVoid Mar 24, 2021

Choose a reason for hiding this comment

kkraus14 left a comment

Choose a reason for hiding this comment

kkraus14 commented Mar 25, 2021

dillon-cullinan commented Mar 25, 2021

kkraus14 commented Mar 25, 2021

skirui-source commented Mar 25, 2021 • edited Loading

kkraus14 commented Mar 25, 2021

Add Python bindings for `lists::contains` #7547

Add Python bindings for `lists::contains` #7547

codecov bot commented Mar 13, 2021 •

edited

Loading

skirui-source Mar 17, 2021 •

edited

Loading

skirui-source commented Mar 25, 2021 •

edited

Loading