Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python bindings for lists::contains #7547

Merged
merged 17 commits into from
Mar 25, 2021

Conversation

skirui-source
Copy link
Contributor

No description provided.

@skirui-source skirui-source added feature request New feature or request Python Affects Python cuDF API. Cython non-breaking Non-breaking change labels Mar 10, 2021
@skirui-source skirui-source requested review from shwina and isVoid March 10, 2021 06:18
@skirui-source skirui-source self-assigned this Mar 10, 2021
@skirui-source skirui-source added the 3 - Ready for Review Ready for review by team label Mar 13, 2021
@codecov
Copy link

codecov bot commented Mar 13, 2021

Codecov Report

Merging #7547 (961761e) into branch-0.19 (7871e7a) will increase coverage by 0.52%.
The diff coverage is 90.47%.

❗ Current head 961761e differs from pull request most recent head c784295. Consider uploading reports for the commit c784295 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7547      +/-   ##
===============================================
+ Coverage        81.86%   82.38%   +0.52%     
===============================================
  Files              101      101              
  Lines            16884    17353     +469     
===============================================
+ Hits             13822    14297     +475     
+ Misses            3062     3056       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/column.py 87.83% <75.00%> (+0.07%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.75% <90.00%> (-2.12%) ⬇️
python/cudf/cudf/core/column/lists.py 92.07% <100.00%> (+0.68%) ⬆️
python/cudf/cudf/core/column/string.py 86.76% <100.00%> (+0.26%) ⬆️
python/cudf/cudf/utils/gpu_utils.py 53.65% <0.00%> (-4.88%) ⬇️
python/cudf/cudf/core/abc.py 87.23% <0.00%> (-1.14%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.85% <0.00%> (-0.17%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
... and 48 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3136124...c784295. Read the comment docs.

@skirui-source skirui-source marked this pull request as ready for review March 16, 2021 19:52
@skirui-source skirui-source requested a review from a team as a code owner March 16, 2021 19:52
@@ -112,3 +112,32 @@ def test_len(data):
got = gsr.list.len()

assert_eq(expect, got, check_dtype=False)


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need more test cases to cover:

* Output `column[i]` is set to null if one or more of the following are true:
* 1. The search key `search_key` is null
* 2. The list row `lists[i]` is null
* 3. The list row `lists[i]` does not contain the search key, and contains at least
* one null.

Copy link
Contributor Author

@skirui-source skirui-source Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1 and 2, what is the correct dtype for the null search_key when constructing the Scalar?
i.e search_key= cudf.Scalar(value=None, dtype=?) .

Couldn't find it on cudf docs --> https://docs.rapids.ai/api/cudf/stable/basics.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about late reply. I think for libcudf even to work, they have to be:

  1. Of the same type of your list column elements
    CUDF_EXPECTS(lists.child().type() == search_key.type(),
    "Type/Scale of search key does not match list column element type.");
  2. Is one these types
    cudf::is_numeric<ElementType>() || cudf::is_chrono<ElementType>() ||
    cudf::is_fixed_point<ElementType>() || std::is_same<ElementType, cudf::string_view>::value;

But then another question to ask is, what error should be raised if the user passes in different types of scalars?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading @mythrocks 's PR thread, it looks like these behaviors are to align with SQL's array_contain. Should cudf follow the same semantics? @kkraus14

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think requiring the types to match in the Cython function is absolutely reasonable. If we want to handle automatically typecasting we can do so from the Python and do it in a follow up.

I figure a TypeError would be most appropriate to raise here, but since libcudf already does it for us, we could do so via a try / except?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I would like to discuss about the use of try/except here. My argument against it is that checking the datatype of a column is almost trivial, as compared to one like null_counts. So a double check is acceptable here. The downside of try/except is that python starts to depend on the error strings that libcudf throws - which is rather a random match at this stage and libcudf has no guarantee of the actual string being thrown (besides cudf::logic_error type).

So I suggest we do explicit python check if possible and resort to try/except when the check is non-trivial. I'm open to suggestions. cc @shwina

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try...except would be Pythonic if we could catch anything more specific than a generic RuntimeError. It seems like the only mechanism libcudf has to communicate any more information than "something went wrong" is the error message itself. On the other hand, relying on string matching seems fragile, especially given that error messages aren't really guaranteed to remain the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Dup of PM) It would be great if libcudf has a set error classes that inherits std::runtime_error, which can be captured by Cython if thrown. These error classes can maintain some common error interfaces like wrong types; out of bounds access; unsupported operation etc.

python/cudf/cudf/core/column/lists.py Outdated Show resolved Hide resolved
@skirui-source skirui-source requested a review from isVoid March 23, 2021 00:51
@skirui-source skirui-source requested a review from kkraus14 March 24, 2021 04:09
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

python/cudf/cudf/tests/test_list.py Outdated Show resolved Hide resolved
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 24, 2021
@kkraus14
Copy link
Collaborator

rerun tests

1 similar comment
@dillon-cullinan
Copy link
Contributor

rerun tests

@kkraus14
Copy link
Collaborator

Looks like there's a few failures:

data = [[1, 2, 3], []], expect = 0    <NA>
1    <NA>
dtype: float64

    @pytest.mark.parametrize(
        "data, expect",
        [
            ([[1, 2, 3], []], [None, None],),
            ([[1.0, 2.0, 3.0], None, []], [None, None, None],),
            ([[None, 2, 3], [], None], [None, None, None],),
            ([[1, 2, 3], [3, 4, 5]], [None, None],),
            ([[], [], []], [None, None, None],),
        ],
    )
    def test_contains_null_search_key(data, expect):
        sr = cudf.Series(data)
        expect = cudf.Series(expect)
        got = sr.list.contains(cudf.Scalar(cudf.NA, sr.dtype.element_type))
>       assert_eq(expect, got)

cudf/tests/test_list.py:283: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = 0   NaN
1   NaN
dtype: float64, right = 0    None
1    None
dtype: object
kwargs = {}

    def assert_eq(left, right, **kwargs):
        """ Assert that two cudf-like things are equivalent
    
        This equality test works for pandas/cudf dataframes/series/indexes/scalars
        in the same way, and so makes it easier to perform parametrized testing
        without switching between assert_frame_equal/assert_series_equal/...
        functions.
        """
        if hasattr(left, "to_pandas"):
            left = left.to_pandas()
        if hasattr(right, "to_pandas"):
            right = right.to_pandas()
        if isinstance(left, cupy.ndarray):
            left = cupy.asnumpy(left)
        if isinstance(right, cupy.ndarray):
            right = cupy.asnumpy(right)
    
        if isinstance(left, pd.DataFrame):
            tm.assert_frame_equal(left, right, **kwargs)
        elif isinstance(left, pd.Series):
>           tm.assert_series_equal(left, right, **kwargs)
E           AssertionError: Attributes of Series are different
E           
E           Attribute "dtype" are different
E           [left]:  float64
E           [right]: object

cudf/tests/utils.py:89: AssertionError

@kkraus14 kkraus14 added 0 - Waiting on Author Waiting for author to respond to review and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Mar 25, 2021
@skirui-source
Copy link
Contributor Author

skirui-source commented Mar 25, 2021

Looks like there's a few failures:

data = [[1, 2, 3], []], expect = 0    <NA>
1    <NA>
dtype: float64

    @pytest.mark.parametrize(
        "data, expect",
        [
            ([[1, 2, 3], []], [None, None],),
            ([[1.0, 2.0, 3.0], None, []], [None, None, None],),
            ([[None, 2, 3], [], None], [None, None, None],),
            ([[1, 2, 3], [3, 4, 5]], [None, None],),
            ([[], [], []], [None, None, None],),
        ],
    )
    def test_contains_null_search_key(data, expect):
        sr = cudf.Series(data)
        expect = cudf.Series(expect)
        got = sr.list.contains(cudf.Scalar(cudf.NA, sr.dtype.element_type))
>       assert_eq(expect, got)

cudf/tests/test_list.py:283: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = 0   NaN
1   NaN
dtype: float64, right = 0    None
1    None
dtype: object
kwargs = {}

    def assert_eq(left, right, **kwargs):
        """ Assert that two cudf-like things are equivalent
    
        This equality test works for pandas/cudf dataframes/series/indexes/scalars
        in the same way, and so makes it easier to perform parametrized testing
        without switching between assert_frame_equal/assert_series_equal/...
        functions.
        """
        if hasattr(left, "to_pandas"):
            left = left.to_pandas()
        if hasattr(right, "to_pandas"):
            right = right.to_pandas()
        if isinstance(left, cupy.ndarray):
            left = cupy.asnumpy(left)
        if isinstance(right, cupy.ndarray):
            right = cupy.asnumpy(right)
    
        if isinstance(left, pd.DataFrame):
            tm.assert_frame_equal(left, right, **kwargs)
        elif isinstance(left, pd.Series):
>           tm.assert_series_equal(left, right, **kwargs)
E           AssertionError: Attributes of Series are different
E           
E           Attribute "dtype" are different
E           [left]:  float64
E           [right]: object

cudf/tests/utils.py:89: AssertionError

I just fixed it, you can re-run the tests now 👍

@skirui-source skirui-source added 3 - Ready for Review Ready for review by team and removed 0 - Waiting on Author Waiting for author to respond to review labels Mar 25, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 25, 2021
@skirui-source skirui-source linked an issue Mar 25, 2021 that may be closed by this pull request
@rapids-bot rapids-bot bot merged commit 000978e into rapidsai:branch-0.19 Mar 25, 2021
@skirui-source skirui-source deleted the contains branch March 25, 2021 20:31
@skirui-source skirui-source restored the contains branch March 25, 2021 20:36
@skirui-source skirui-source deleted the contains branch May 6, 2021 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Python bindings for lists::contains
5 participants