-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Series
comparison vs scalars
#12519
Fix Series
comparison vs scalars
#12519
Conversation
@brandon-b-miller This seems closely related to my (long-stalled due to GHA migration) work on #11609. We may have some broader fixes to make for equality and inequality operators in particular that do not affect other operators, because those operators are originally defined by |
Ah you're right! Indeed, I have not added test cases for ops that should raise. I like the approach in that PR for placing the logic for those error cases up in the mixin. My initial reaction was going to be to throw in |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## branch-23.04 #12519 +/- ##
===============================================
Coverage ? 85.81%
===============================================
Files ? 158
Lines ? 25162
Branches ? 0
===============================================
Hits ? 21592
Misses ? 3570
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
str
dtype Series
comparison vs scalarsSeries
comparison vs scalars
Discovered a similar issue comparing series of datetime dtype to strings such as |
elif op == "__ne__": | ||
val = True | ||
return column.full(len(self), val, dtype="bool").set_mask( | ||
self.mask | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif op == "__ne__": | |
val = True | |
return column.full(len(self), val, dtype="bool").set_mask( | |
self.mask | |
) | |
elif op == "__ne__": | |
val = True | |
else: | |
raise TypeError(f"{op} between {self.dtype} and {other.dtype}") not supported") | |
return column.full(len(self), val, dtype="bool").set_mask( | |
self.mask | |
) |
We might want to have an else block here to throw an error for rest of the ops like +
, -
, etc. Else we run into unbounded variable errors:
In [2]: s = cudf.Series([1, 2, 1, 2], dtype='str')
In [5]: s + '1'
Out[5]:
0 11
1 21
2 11
3 21
dtype: object
In [6]: s + 1
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
Cell In[6], line 1
----> 1 s + 1
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/mixins/mixin_factory.py:11, in _partialmethod.<locals>.wrapper(self, *args2, **kwargs2)
10 def wrapper(self, *args2, **kwargs2):
---> 11 return method(self, *args1, *args2, **kwargs1, **kwargs2)
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/indexed_frame.py:3282, in IndexedFrame._binaryop(self, other, op, fill_value, can_reindex, *args, **kwargs)
3278 if operands is NotImplemented:
3279 return NotImplemented
3281 return self._from_data(
-> 3282 ColumnAccessor(type(self)._colwise_binop(operands, op)),
3283 index=out_index,
3284 )
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
76 @wraps(func)
77 def inner(*args, **kwds):
78 with self._recreate_cm():
---> 79 return func(*args, **kwds)
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/frame.py:1728, in Frame._colwise_binop(cls, operands, fn)
1720 assert False, "At least one operand must be a column."
1722 # TODO: Disable logical and binary operators between columns that
1723 # are not numerical using the new binops mixin.
1725 outcol = (
1726 getattr(operator, fn)(right_column, left_column)
1727 if reflect
-> 1728 else getattr(operator, fn)(left_column, right_column)
1729 )
1731 if output_mask is not None:
1732 outcol = outcol.set_mask(output_mask)
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/mixins/mixin_factory.py:11, in _partialmethod.<locals>.wrapper(self, *args2, **kwargs2)
10 def wrapper(self, *args2, **kwargs2):
---> 11 return method(self, *args1, *args2, **kwargs1, **kwargs2)
File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/core/column/string.py:5709, in StringColumn._binaryop(self, other, op)
5707 elif op == "__ne__":
5708 val = True
-> 5709 return column.full(len(self), val, dtype="bool").set_mask(
5710 self.mask
5711 )
5713 if op == "__add__":
5714 if isinstance(other, cudf.Scalar):
UnboundLocalError: local variable 'val' referenced before assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I failed to hit go on my review. I think I have one query around the handling of ordering comparisons between string and non-string that this introduces. Along with a request to refact to the test fixtures a little.
"__lt__", | ||
"__le__", | ||
"__gt__", | ||
"__ge__", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong. Python, and pandas raise TypeError
for ordering between str and not-str:
In [5]: s = pd.Series(["1", "2", "3"])
In [6]: s < 1
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[6], line 1
----> 1 s < 1
File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/common.py:72, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
68 return NotImplemented
70 other = item_from_zerodim(other)
---> 72 return method(self, other)
File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/arraylike.py:50, in OpsMixin.__lt__(self, other)
48 @unpack_zerodim_and_defer("__lt__")
49 def __lt__(self, other):
---> 50 return self._cmp_method(other, operator.lt)
File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/series.py:6243, in Series._cmp_method(self, other, op)
6240 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
6242 with np.errstate(all="ignore"):
-> 6243 res_values = ops.comparison_op(lvalues, rvalues, op)
6245 return self._construct_result(res_values, name=res_name)
File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/array_ops.py:287, in comparison_op(left, right, op)
284 return invalid_comparison(lvalues, rvalues, op)
286 elif is_object_dtype(lvalues.dtype) or isinstance(rvalues, str):
--> 287 res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
289 else:
290 res_values = _na_arithmetic_op(lvalues, rvalues, op, is_cmp=True)
File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/core/ops/array_ops.py:75, in comp_method_OBJECT_ARRAY(op, x, y)
73 result = libops.vec_compare(x.ravel(), y.ravel(), op)
74 else:
---> 75 result = libops.scalar_compare(x.ravel(), y, op)
76 return result.reshape(x.shape)
File ~/compose/etc/conda/cuda_11.8/envs/rapids/lib/python3.8/site-packages/pandas/_libs/ops.pyx:107, in pandas._libs.ops.scalar_compare()
TypeError: '<' not supported between instances of 'str' and 'int'
Which I think is achievable by return NotImplemented
instead?
val = False | ||
if op == "__ne__": | ||
val = True | ||
return column.full(len(self), val, dtype="bool").set_mask( | ||
self.mask | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val = False | |
if op == "__ne__": | |
val = True | |
return column.full(len(self), val, dtype="bool").set_mask( | |
self.mask | |
) | |
return column.full(len(self), op == "__ne__", dtype="bool").set_mask( | |
self.mask | |
) |
for case in eq_neq_cases: | ||
cases += [(*case, operator.eq), (*case, operator.ne)] | ||
|
||
return cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a fixture to generate these cases, something like:
@pytest.fixture
def string_series_compare():
return pd.Series(...)
@pytest.fixture(params=[False, True], ids=["normal", "reflected"])
def string_series_compare_reflect(request):
return request.param
@pytest.fixture(params=[1, 1.5, True], ids=["int", "float", "bool"])
def string_series_compare_other(request):
return request.param
@pytest.fixture(params=_cmpops):
def cmpop(request):
return request.param
def test_string_series_compare(string_series_compare, string_series_compare_reflect, string_series_compare_other, cmpop):
assert whatever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand a bit more why this approach is better? I've worked it up locally a few different ways and it seems like this forces some kind of programmatic skipping inside the test body since we're not dealing with a cartesian product of the parameterization, whereas the current approach just generates the exact tests we need. Is it the test output that you're concerned about here? (I think this would show up as test_string_series_compare[data0]
on the command line for instance as opposed to an expanded output listing the actual parameters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to move away from a setup where test collection creates heavyweight objects (e.g. the series here).
I see that you don't actually need the cartesian product. Does it work to separate the eq/neq cases from the other?
You can then have two test functions that handle the general case and then the eq/neq case taking different parameters. There's some discussion here https://docs.rapids.ai/api/cudf/nightly/developer_guide/testing.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the context. An attempt at 7172f04
Can you merge trunk again (to pick up #12723), I think that will fix the java test fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
/merge |
Fixes an issue where this happens:
It turns out this happens because
StringColumn
'snormalize_binop_value
method returnsNotImplemented
for scalars that are not of dtypeobject
. This eventually causes python to dispatch to the python scalar class'__eq__
which returns the scalarFalse
when encountering a cuDF object. cuDF expects a column object at this point but has a scalar.This in turn causes cuDF to try and construct a
ColumnAccessor
around a dict that looks like{'name', False}
ultimately throwing the error.This PR proposes to earlystop this behavior according to the rules for comparing python string scalars with other objects:
False
for__eq__
even if the character in the string is equivalent to whatever is being comparedTrue
for__ne__
ditto above.This should align us with pandas behavior for this case:
EDIT:
Updating this PR to handle a similar issue resulting in the same error when comparing datetime series to strings that contain valid datetimes, such as
20110101
.