Fix `Series` comparison vs scalars #12519

brandon-b-miller · 2023-01-10T16:19:40Z

Fixes an issue where this happens:

import cudf
cudf.Series(['a','b','c']) == 1

  File "/raid/brmiller/anaconda/envs/cudf_dev/lib/python3.9/site-packages/cudf/core/mixins/mixin_factory.py", line 11, in wrapper
    return method(self, *args1, *args2, **kwargs1, **kwargs2)
  File "/raid/brmiller/anaconda/envs/cudf_dev/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 3278, in _binaryop
    ColumnAccessor(type(self)._colwise_binop(operands, op)),
  File "/raid/brmiller/anaconda/envs/cudf_dev/lib/python3.9/site-packages/cudf/core/column_accessor.py", line 124, in __init__
    column_length = len(data[next(iter(data))])
TypeError: object of type 'bool' has no len()

It turns out this happens because StringColumn's normalize_binop_value method returns NotImplemented for scalars that are not of dtype object. This eventually causes python to dispatch to the python scalar class' __eq__ which returns the scalar False when encountering a cuDF object. cuDF expects a column object at this point but has a scalar.

This in turn causes cuDF to try and construct a ColumnAccessor around a dict that looks like {'name', False} ultimately throwing the error.

This PR proposes to earlystop this behavior according to the rules for comparing python string scalars with other objects:

Always return False for __eq__ even if the character in the string is equivalent to whatever is being compared
Always return True for __ne__ ditto above.
Copy the input mask

This should align us with pandas behavior for this case:

>>> pd.Series(['a','b', 'c'], dtype='string') == 1
0    False
1    False
2    False
dtype: boolean
>>> pd.Series(['a','b', 'c'], dtype='string') != 1
0    True
1    True
2    True
dtype: boolean

EDIT:
Updating this PR to handle a similar issue resulting in the same error when comparing datetime series to strings that contain valid datetimes, such as 20110101.

bdice · 2023-01-10T16:53:49Z

@brandon-b-miller This seems closely related to my (long-stalled due to GHA migration) work on #11609. We may have some broader fixes to make for equality and inequality operators in particular that do not affect other operators, because those operators are originally defined by object and are inherited, making overrides act in slightly atypical ways. I’ll investigate the overlap when I review this!

brandon-b-miller · 2023-01-10T16:58:59Z

@brandon-b-miller This seems closely related to my (long-stalled due to GHA migration) work on #11609. We may have some broader fixes to make for equality and inequality operators in particular. I’ll investigate the overlap when I review this!

Ah you're right! Indeed, I have not added test cases for ops that should raise. I like the approach in that PR for placing the logic for those error cases up in the mixin. My initial reaction was going to be to throw in _binaryop around where this PR is already touching code, but that's not nearly as general/reusable. It might make sense to leave the logic for the cases that should return something here though.

codecov · 2023-01-10T17:34:55Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@f0a17ca). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 7172f04 differs from pull request most recent head 06cf9af. Consider uploading reports for the commit 06cf9af to get more accurate results

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-23.04   #12519   +/-   ##
===============================================
  Coverage                ?   85.81%           
===============================================
  Files                   ?      158           
  Lines                   ?    25162           
  Branches                ?        0           
===============================================
  Hits                    ?    21592           
  Misses                  ?     3570           
  Partials                ?        0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

brandon-b-miller · 2023-01-11T15:20:56Z

Discovered a similar issue comparing series of datetime dtype to strings such as 20110101, rolling the change into this PR. This piece of the change still needs tests.

galipremsagar · 2023-02-02T19:23:42Z

python/cudf/cudf/core/column/string.py

+                elif op == "__ne__":
+                    val = True
+                return column.full(len(self), val, dtype="bool").set_mask(
+                    self.mask
+                )


Suggested change

elif op == "__ne__":

val = True

return column.full(len(self), val, dtype="bool").set_mask(

self.mask

)

elif op == "__ne__":

val = True

else:

raise TypeError(f"{op} between {self.dtype} and {other.dtype}") not supported")

return column.full(len(self), val, dtype="bool").set_mask(

self.mask

)

We might want to have an else block here to throw an error for rest of the ops like +, -, etc. Else we run into unbounded variable errors:

In [2]: s = cudf.Series([1, 2, 1, 2], dtype='str') In [5]: s + '1' Out[5]: 0 11 1 21 2 11 3 21 dtype: object In [6]: s + 1 --------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call last) Cell In[6], line 1 ----> 1 s + 1 File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/mixins/mixin_factory.py:11, in _partialmethod.<locals>.wrapper(self, *args2, **kwargs2) 10 def wrapper(self, *args2, **kwargs2): ---> 11 return method(self, *args1, *args2, **kwargs1, **kwargs2) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/indexed_frame.py:3282, in IndexedFrame._binaryop(self, other, op, fill_value, can_reindex, *args, **kwargs) 3278 if operands is NotImplemented: 3279 return NotImplemented 3281 return self._from_data( -> 3282 ColumnAccessor(type(self)._colwise_binop(operands, op)), 3283 index=out_index, 3284 ) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds) 76 @wraps(func) 77 def inner(*args, **kwds): 78 with self._recreate_cm(): ---> 79 return func(*args, **kwds) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/frame.py:1728, in Frame._colwise_binop(cls, operands, fn) 1720 assert False, "At least one operand must be a column." 1722 # TODO: Disable logical and binary operators between columns that 1723 # are not numerical using the new binops mixin. 1725 outcol = ( 1726 getattr(operator, fn)(right_column, left_column) 1727 if reflect -> 1728 else getattr(operator, fn)(left_column, right_column) 1729 ) 1731 if output_mask is not None: 1732 outcol = outcol.set_mask(output_mask) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/mixins/mixin_factory.py:11, in _partialmethod.<locals>.wrapper(self, *args2, **kwargs2) 10 def wrapper(self, *args2, **kwargs2): ---> 11 return method(self, *args1, *args2, **kwargs1, **kwargs2) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/column/string.py:5709, in StringColumn._binaryop(self, other, op) 5707 elif op == "__ne__": 5708 val = True -> 5709 return column.full(len(self), val, dtype="bool").set_mask( 5710 self.mask 5711 ) 5713 if op == "__add__": 5714 if isinstance(other, cudf.Scalar): UnboundLocalError: local variable 'val' referenced before assignment

Adjusted this

wence-

Sorry, I failed to hit go on my review. I think I have one query around the handling of ordering comparisons between string and non-string that this introduces. Along with a request to refact to the test fixtures a little.

wence- · 2023-02-03T11:18:33Z

python/cudf/cudf/core/column/string.py

+                    "__lt__",
+                    "__le__",
+                    "__gt__",
+                    "__ge__",


This seems wrong. Python, and pandas raise TypeError for ordering between str and not-str:

In [5]: s = pd.Series(["1", "2", "3"]) In [6]: s < 1 --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[6], line 1 ----> 1 s < 1 File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/common.py:72, in _unpack_zerodim_and_defer.<locals>.new_method(self, other) 68 return NotImplemented 70 other = item_from_zerodim(other) ---> 72 return method(self, other) File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/arraylike.py:50, in OpsMixin.__lt__(self, other) 48 @unpack_zerodim_and_defer("__lt__") 49 def __lt__(self, other): ---> 50 return self._cmp_method(other, operator.lt) File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/series.py:6243, in Series._cmp_method(self, other, op) 6240 rvalues = extract_array(other, extract_numpy=True, extract_range=True) 6242 with np.errstate(all="ignore"): -> 6243 res_values = ops.comparison_op(lvalues, rvalues, op) 6245 return self._construct_result(res_values, name=res_name) File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/array_ops.py:287, in comparison_op(left, right, op) 284 return invalid_comparison(lvalues, rvalues, op) 286 elif is_object_dtype(lvalues.dtype) or isinstance(rvalues, str): --> 287 res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues) 289 else: 290 res_values = _na_arithmetic_op(lvalues, rvalues, op, is_cmp=True) File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/array_ops.py:75, in comp_method_OBJECT_ARRAY(op, x, y) 73 result = libops.vec_compare(x.ravel(), y.ravel(), op) 74 else: ---> 75 result = libops.scalar_compare(x.ravel(), y, op) 76 return result.reshape(x.shape) File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/_libs/ops.pyx:107, in pandas._libs.ops.scalar_compare() TypeError: '<' not supported between instances of 'str' and 'int'

Which I think is achievable by return NotImplemented instead?

wence- · 2023-02-03T11:19:33Z

python/cudf/cudf/core/column/string.py

+                    val = False
+                    if op == "__ne__":
+                        val = True
+                    return column.full(len(self), val, dtype="bool").set_mask(
+                        self.mask
+                    )


Suggested change

val = False

if op == "__ne__":

val = True

return column.full(len(self), val, dtype="bool").set_mask(

self.mask

)

return column.full(len(self), op == "__ne__", dtype="bool").set_mask(

self.mask

)

wence- · 2023-02-03T11:31:18Z

python/cudf/cudf/tests/test_binops.py

+    for case in eq_neq_cases:
+        cases += [(*case, operator.eq), (*case, operator.ne)]
+
+    return cases


Please use a fixture to generate these cases, something like:

@pytest.fixture def string_series_compare(): return pd.Series(...) @pytest.fixture(params=[False, True], ids=["normal", "reflected"]) def string_series_compare_reflect(request): return request.param @pytest.fixture(params=[1, 1.5, True], ids=["int", "float", "bool"]) def string_series_compare_other(request): return request.param @pytest.fixture(params=_cmpops): def cmpop(request): return request.param def test_string_series_compare(string_series_compare, string_series_compare_reflect, string_series_compare_other, cmpop): assert whatever

Could you help me understand a bit more why this approach is better? I've worked it up locally a few different ways and it seems like this forces some kind of programmatic skipping inside the test body since we're not dealing with a cartesian product of the parameterization, whereas the current approach just generates the exact tests we need. Is it the test output that you're concerned about here? (I think this would show up as test_string_series_compare[data0] on the command line for instance as opposed to an expanded output listing the actual parameters)

We're trying to move away from a setup where test collection creates heavyweight objects (e.g. the series here).

I see that you don't actually need the cartesian product. Does it work to separate the eq/neq cases from the other?

You can then have two test functions that handle the general case and then the eq/neq case taking different parameters. There's some discussion here https://docs.rapids.ai/api/cudf/nightly/developer_guide/testing.html

Got it, thanks for the context. An attempt at 7172f04

wence- · 2023-02-09T10:13:46Z

Can you merge trunk again (to pick up #12723), I think that will fix the java test fails.

wence-

Thanks!

brandon-b-miller · 2023-02-10T17:36:19Z

/merge

brandon-b-miller added 4 commits January 10, 2023 07:52

add tests, pass through non-object type scalars

a56cc27

Merge branch 'branch-23.02' into fix-series-compare-vs-int

9b25ba1

style

a694b2b

small update

8fb8c11

brandon-b-miller added bug Something isn't working Python Affects Python cuDF API. non-breaking Non-breaking change labels Jan 10, 2023

brandon-b-miller requested a review from a team as a code owner January 10, 2023 16:19

brandon-b-miller self-assigned this Jan 10, 2023

brandon-b-miller requested review from bdice and isVoid January 10, 2023 16:19

pass string scalars through datetime binops if possible

671e7e5

brandon-b-miller changed the title ~~Fix str dtype Series comparison vs scalars~~ Fix Series comparison vs scalars Jan 11, 2023

brandon-b-miller added 3 commits January 12, 2023 06:28

add tests for passing through strings to datetime binops

8c30b11

Merge branch 'branch-23.02' into fix-series-compare-vs-int

1783753

Merge branch 'branch-23.04' into fix-series-compare-vs-int

2394d84

brandon-b-miller requested review from a team as code owners February 2, 2023 15:14

brandon-b-miller requested review from karthikeyann and removed request for a team February 2, 2023 15:14

brandon-b-miller changed the base branch from branch-23.02 to branch-23.04 February 2, 2023 15:14

brandon-b-miller removed the request for review from a team February 2, 2023 15:15

brandon-b-miller removed request for a team and karthikeyann February 2, 2023 15:15

galipremsagar reviewed Feb 2, 2023

View reviewed changes

adjust logic

12bdb88

galipremsagar approved these changes Feb 2, 2023

View reviewed changes

wence- requested changes Feb 6, 2023

View reviewed changes

brandon-b-miller added 3 commits February 6, 2023 07:48

Merge branch 'branch-23.04' into fix-series-compare-vs-int

cfee35b

Adjust logic

aa1c6a5

refactor to fixtures

7172f04

Merge branch 'branch-23.04' into fix-series-compare-vs-int

06cf9af

wence- approved these changes Feb 9, 2023

View reviewed changes

rapids-bot bot merged commit c4a1389 into rapidsai:branch-23.04 Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `Series` comparison vs scalars #12519

Fix `Series` comparison vs scalars #12519

brandon-b-miller commented Jan 10, 2023 •

edited

Loading

bdice commented Jan 10, 2023 •

edited

Loading

brandon-b-miller commented Jan 10, 2023

codecov bot commented Jan 10, 2023 •

edited

Loading

brandon-b-miller commented Jan 11, 2023

galipremsagar Feb 2, 2023

brandon-b-miller Feb 2, 2023

wence- left a comment

wence- Feb 3, 2023

wence- Feb 3, 2023

wence- Feb 3, 2023

brandon-b-miller Feb 8, 2023

wence- Feb 8, 2023

brandon-b-miller Feb 8, 2023

wence- commented Feb 9, 2023

wence- left a comment

brandon-b-miller commented Feb 10, 2023

Fix Series comparison vs scalars #12519

Fix Series comparison vs scalars #12519

Conversation

brandon-b-miller commented Jan 10, 2023 • edited Loading

bdice commented Jan 10, 2023 • edited Loading

brandon-b-miller commented Jan 10, 2023

codecov bot commented Jan 10, 2023 • edited Loading

Codecov Report

brandon-b-miller commented Jan 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wence- commented Feb 9, 2023

wence- left a comment

Choose a reason for hiding this comment

brandon-b-miller commented Feb 10, 2023

Fix `Series` comparison vs scalars #12519

Fix `Series` comparison vs scalars #12519

brandon-b-miller commented Jan 10, 2023 •

edited

Loading

bdice commented Jan 10, 2023 •

edited

Loading

codecov bot commented Jan 10, 2023 •

edited

Loading