Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Series comparison vs scalars #12519

Merged

Conversation

brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Jan 10, 2023

Fixes an issue where this happens:

import cudf
cudf.Series(['a','b','c']) == 1
  File "/raid/brmiller/anaconda/envs/cudf_dev/lib/python3.9/site-packages/cudf/core/mixins/mixin_factory.py", line 11, in wrapper
    return method(self, *args1, *args2, **kwargs1, **kwargs2)
  File "/raid/brmiller/anaconda/envs/cudf_dev/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 3278, in _binaryop
    ColumnAccessor(type(self)._colwise_binop(operands, op)),
  File "/raid/brmiller/anaconda/envs/cudf_dev/lib/python3.9/site-packages/cudf/core/column_accessor.py", line 124, in __init__
    column_length = len(data[next(iter(data))])
TypeError: object of type 'bool' has no len()

It turns out this happens because StringColumn's normalize_binop_value method returns NotImplemented for scalars that are not of dtype object. This eventually causes python to dispatch to the python scalar class' __eq__ which returns the scalar False when encountering a cuDF object. cuDF expects a column object at this point but has a scalar.

This in turn causes cuDF to try and construct a ColumnAccessor around a dict that looks like {'name', False} ultimately throwing the error.

This PR proposes to earlystop this behavior according to the rules for comparing python string scalars with other objects:

  • Always return False for __eq__ even if the character in the string is equivalent to whatever is being compared
  • Always return True for __ne__ ditto above.
  • Copy the input mask

This should align us with pandas behavior for this case:

>>> pd.Series(['a','b', 'c'], dtype='string') == 1
0    False
1    False
2    False
dtype: boolean
>>> pd.Series(['a','b', 'c'], dtype='string') != 1
0    True
1    True
2    True
dtype: boolean

EDIT:
Updating this PR to handle a similar issue resulting in the same error when comparing datetime series to strings that contain valid datetimes, such as 20110101.

@brandon-b-miller brandon-b-miller added bug Something isn't working Python Affects Python cuDF API. non-breaking Non-breaking change labels Jan 10, 2023
@brandon-b-miller brandon-b-miller requested a review from a team as a code owner January 10, 2023 16:19
@brandon-b-miller brandon-b-miller self-assigned this Jan 10, 2023
@bdice
Copy link
Contributor

bdice commented Jan 10, 2023

@brandon-b-miller This seems closely related to my (long-stalled due to GHA migration) work on #11609. We may have some broader fixes to make for equality and inequality operators in particular that do not affect other operators, because those operators are originally defined by object and are inherited, making overrides act in slightly atypical ways. I’ll investigate the overlap when I review this!

@brandon-b-miller
Copy link
Contributor Author

@brandon-b-miller This seems closely related to my (long-stalled due to GHA migration) work on #11609. We may have some broader fixes to make for equality and inequality operators in particular. I’ll investigate the overlap when I review this!

Ah you're right! Indeed, I have not added test cases for ops that should raise. I like the approach in that PR for placing the logic for those error cases up in the mixin. My initial reaction was going to be to throw in _binaryop around where this PR is already touching code, but that's not nearly as general/reusable. It might make sense to leave the logic for the cases that should return something here though.

@codecov
Copy link

codecov bot commented Jan 10, 2023

Codecov Report

❗ No coverage uploaded for pull request base (branch-23.04@f0a17ca). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 7172f04 differs from pull request most recent head 06cf9af. Consider uploading reports for the commit 06cf9af to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-23.04   #12519   +/-   ##
===============================================
  Coverage                ?   85.81%           
===============================================
  Files                   ?      158           
  Lines                   ?    25162           
  Branches                ?        0           
===============================================
  Hits                    ?    21592           
  Misses                  ?     3570           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@brandon-b-miller brandon-b-miller changed the title Fix str dtype Series comparison vs scalars Fix Series comparison vs scalars Jan 11, 2023
@brandon-b-miller
Copy link
Contributor Author

Discovered a similar issue comparing series of datetime dtype to strings such as 20110101, rolling the change into this PR. This piece of the change still needs tests.

@brandon-b-miller brandon-b-miller requested review from a team as code owners February 2, 2023 15:14
@brandon-b-miller brandon-b-miller requested review from karthikeyann and removed request for a team February 2, 2023 15:14
@brandon-b-miller brandon-b-miller changed the base branch from branch-23.02 to branch-23.04 February 2, 2023 15:14
@brandon-b-miller brandon-b-miller removed the request for review from a team February 2, 2023 15:15
@brandon-b-miller brandon-b-miller removed request for a team and karthikeyann February 2, 2023 15:15
Comment on lines 5707 to 5711
elif op == "__ne__":
val = True
return column.full(len(self), val, dtype="bool").set_mask(
self.mask
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif op == "__ne__":
val = True
return column.full(len(self), val, dtype="bool").set_mask(
self.mask
)
elif op == "__ne__":
val = True
else:
raise TypeError(f"{op} between {self.dtype} and {other.dtype}") not supported")
return column.full(len(self), val, dtype="bool").set_mask(
self.mask
)

We might want to have an else block here to throw an error for rest of the ops like +, -, etc. Else we run into unbounded variable errors:

In [2]: s = cudf.Series([1, 2, 1, 2], dtype='str')

In [5]: s + '1'
Out[5]: 
0    11
1    21
2    11
3    21
dtype: object

In [6]: s + 1
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
Cell In[6], line 1
----> 1 s + 1

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/mixins/mixin_factory.py:11, in _partialmethod.<locals>.wrapper(self, *args2, **kwargs2)
     10 def wrapper(self, *args2, **kwargs2):
---> 11     return method(self, *args1, *args2, **kwargs1, **kwargs2)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/indexed_frame.py:3282, in IndexedFrame._binaryop(self, other, op, fill_value, can_reindex, *args, **kwargs)
   3278 if operands is NotImplemented:
   3279     return NotImplemented
   3281 return self._from_data(
-> 3282     ColumnAccessor(type(self)._colwise_binop(operands, op)),
   3283     index=out_index,
   3284 )

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/frame.py:1728, in Frame._colwise_binop(cls, operands, fn)
   1720         assert False, "At least one operand must be a column."
   1722 # TODO: Disable logical and binary operators between columns that
   1723 # are not numerical using the new binops mixin.
   1725 outcol = (
   1726     getattr(operator, fn)(right_column, left_column)
   1727     if reflect
-> 1728     else getattr(operator, fn)(left_column, right_column)
   1729 )
   1731 if output_mask is not None:
   1732     outcol = outcol.set_mask(output_mask)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/mixins/mixin_factory.py:11, in _partialmethod.<locals>.wrapper(self, *args2, **kwargs2)
     10 def wrapper(self, *args2, **kwargs2):
---> 11     return method(self, *args1, *args2, **kwargs1, **kwargs2)

File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/column/string.py:5709, in StringColumn._binaryop(self, other, op)
   5707     elif op == "__ne__":
   5708         val = True
-> 5709     return column.full(len(self), val, dtype="bool").set_mask(
   5710         self.mask
   5711     )
   5713 if op == "__add__":
   5714     if isinstance(other, cudf.Scalar):

UnboundLocalError: local variable 'val' referenced before assignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted this

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I failed to hit go on my review. I think I have one query around the handling of ordering comparisons between string and non-string that this introduces. Along with a request to refact to the test fixtures a little.

Comment on lines 5707 to 5710
"__lt__",
"__le__",
"__gt__",
"__ge__",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. Python, and pandas raise TypeError for ordering between str and not-str:

In [5]: s = pd.Series(["1", "2", "3"])

In [6]: s < 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 s < 1

File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/common.py:72, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     68             return NotImplemented
     70 other = item_from_zerodim(other)
---> 72 return method(self, other)

File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/arraylike.py:50, in OpsMixin.__lt__(self, other)
     48 @unpack_zerodim_and_defer("__lt__")
     49 def __lt__(self, other):
---> 50     return self._cmp_method(other, operator.lt)

File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/series.py:6243, in Series._cmp_method(self, other, op)
   6240 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
   6242 with np.errstate(all="ignore"):
-> 6243     res_values = ops.comparison_op(lvalues, rvalues, op)
   6245 return self._construct_result(res_values, name=res_name)

File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/array_ops.py:287, in comparison_op(left, right, op)
    284     return invalid_comparison(lvalues, rvalues, op)
    286 elif is_object_dtype(lvalues.dtype) or isinstance(rvalues, str):
--> 287     res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
    289 else:
    290     res_values = _na_arithmetic_op(lvalues, rvalues, op, is_cmp=True)

File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/array_ops.py:75, in comp_method_OBJECT_ARRAY(op, x, y)
     73     result = libops.vec_compare(x.ravel(), y.ravel(), op)
     74 else:
---> 75     result = libops.scalar_compare(x.ravel(), y, op)
     76 return result.reshape(x.shape)

File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/_libs/ops.pyx:107, in pandas._libs.ops.scalar_compare()

TypeError: '<' not supported between instances of 'str' and 'int'

Which I think is achievable by return NotImplemented instead?

Comment on lines 5713 to 5718
val = False
if op == "__ne__":
val = True
return column.full(len(self), val, dtype="bool").set_mask(
self.mask
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
val = False
if op == "__ne__":
val = True
return column.full(len(self), val, dtype="bool").set_mask(
self.mask
)
return column.full(len(self), op == "__ne__", dtype="bool").set_mask(
self.mask
)

for case in eq_neq_cases:
cases += [(*case, operator.eq), (*case, operator.ne)]

return cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a fixture to generate these cases, something like:

@pytest.fixture
def string_series_compare():
    return pd.Series(...)

@pytest.fixture(params=[False, True], ids=["normal", "reflected"])
def string_series_compare_reflect(request):
    return request.param

@pytest.fixture(params=[1, 1.5, True], ids=["int", "float", "bool"])
def string_series_compare_other(request):
    return request.param

@pytest.fixture(params=_cmpops):
def cmpop(request):
    return request.param

def test_string_series_compare(string_series_compare, string_series_compare_reflect, string_series_compare_other, cmpop):
    assert whatever

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand a bit more why this approach is better? I've worked it up locally a few different ways and it seems like this forces some kind of programmatic skipping inside the test body since we're not dealing with a cartesian product of the parameterization, whereas the current approach just generates the exact tests we need. Is it the test output that you're concerned about here? (I think this would show up as test_string_series_compare[data0] on the command line for instance as opposed to an expanded output listing the actual parameters)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to move away from a setup where test collection creates heavyweight objects (e.g. the series here).

I see that you don't actually need the cartesian product. Does it work to separate the eq/neq cases from the other?

You can then have two test functions that handle the general case and then the eq/neq case taking different parameters. There's some discussion here https://docs.rapids.ai/api/cudf/nightly/developer_guide/testing.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the context. An attempt at 7172f04

@wence-
Copy link
Contributor

wence- commented Feb 9, 2023

Can you merge trunk again (to pick up #12723), I think that will fix the java test fails.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@brandon-b-miller
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit c4a1389 into rapidsai:branch-23.04 Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants