You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The test test_groupby_diff_row_mixed_numerics failed randomly in CI. The randomly generated values resulted in overflow during subtraction in a datetime column, which caused pandas to fail (cudf may also fail, but that line had not yet run).
Traceback here
Error Message
OverflowError: Overflow in int64 addition
Stacktrace
nelem = 10, shift_perc = 0.5, direction = 1
@pytest.mark.parametrize("nelem", [10, 50, 100, 1000])
@pytest.mark.parametrize("shift_perc", [0.5, 1.0, 1.5])
@pytest.mark.parametrize("direction", [1, -1])
def test_groupby_diff_row_mixed_numerics(nelem, shift_perc, direction):
t = rand_dataframe(
dtypes_meta=[
{"dtype": "int64", "null_frequency": 0, "cardinality": 10},
{"dtype": "int64", "null_frequency": 0.4, "cardinality": 10},
{"dtype": "float32", "null_frequency": 0.4, "cardinality": 10},
{"dtype": "decimal64", "null_frequency": 0.4, "cardinality": 10},
{
"dtype": "datetime64[ns]",
"null_frequency": 0.4,
"cardinality": 10,
},
{
"dtype": "timedelta64[ns]",
"null_frequency": 0.4,
"cardinality": 10,
},
],
rows=nelem,
use_threads=False,
)
pdf = t.to_pandas()
gdf = cudf.from_pandas(pdf)
n_shift = int(nelem * shift_perc) * direction
> expected = pdf.groupby(["0"]).diff(periods=n_shift)
cudf/tests/test_groupby.py:1960:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:948: in wrapper
return self._python_apply_general(curried, self._obj_with_exclusions)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:1309: in _python_apply_general
keys, values, mutated = self.grouper.apply(f, data, self.axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/ops.py:852: in apply
res = f(group)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:937: in curried
return f(x, *args, **kwargs)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/frame.py:8435: in diff
new_data = self._mgr.diff(n=periods, axis=axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/internals/managers.py:374: in diff
return self.apply("diff", n=n, axis=axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/internals/managers.py:327: in apply
applied = getattr(b, f)(**kwargs)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/internals/blocks.py:1754: in diff
new_values = values - values.shift(n, axis=axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/ops/common.py:69: in new_method
return method(self, other)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/arrays/datetimelike.py:1336: in __sub__
result = self._sub_datetime_arraylike(other)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:721: in _sub_datetime_arraylike
new_values = checked_add_with_arr(self_i8, -other_i8, arr_mask=arr_mask)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arr = array([[ 8466425224569324595, -9223372036854775808, 1593875653891806295,
-9223372036854775808, -9223372036854775808, -8738661699754510466]])
b = array([[-9223372036854775808, -9223372036854775808, -9223372036854775808,
-9223372036854775808, -9223372036854775808, -8466425224569324595]])
arr_mask = array([[ True, True, True, True, True, False]]), b_mask = None
def checked_add_with_arr(
arr: np.ndarray,
b,
arr_mask: np.ndarray | None = None,
b_mask: np.ndarray | None = None,
) -> np.ndarray:
"""
Perform array addition that checks for underflow and overflow.
Performs the addition of an int64 array and an int64 integer (or array)
but checks that they do not result in overflow first. For elements that
are indicated to be NaN, whether or not there is overflow for that element
is automatically ignored.
Parameters
----------
arr : array addend.
b : array or scalar addend.
arr_mask : np.ndarray[bool] or None, default None
array indicating which elements to exclude from checking
b_mask : np.ndarray[bool] or None, default None
array or scalar indicating which element(s) to exclude from checking
Returns
-------
sum : An array for elements x + b for each element x in arr if b is
a scalar or an array for elements x + y for each element pair
(x, y) in (arr, b).
Raises
------
OverflowError if any x + y exceeds the maximum or minimum int64 value.
"""
# For performance reasons, we broadcast 'b' to the new array 'b2'
# so that it has the same size as 'arr'.
b2 = np.broadcast_to(b, arr.shape)
if b_mask is not None:
# We do the same broadcasting for b_mask as well.
b2_mask = np.broadcast_to(b_mask, arr.shape)
else:
b2_mask = None
# For elements that are NaN, regardless of their value, we should
# ignore whether they overflow or not when doing the checked add.
if arr_mask is not None and b2_mask is not None:
not_nan = np.logical_not(arr_mask | b2_mask)
elif arr_mask is not None:
not_nan = np.logical_not(arr_mask)
elif b_mask is not None:
not_nan = np.logical_not(b2_mask)
else:
not_nan = np.empty(arr.shape, dtype=bool)
not_nan.fill(True)
# gh-14324: For each element in 'arr' and its corresponding element
# in 'b2', we check the sign of the element in 'b2'. If it is positive,
# we then check whether its sum with the element in 'arr' exceeds
# np.iinfo(np.int64).max. If so, we have an overflow error. If it
# it is negative, we then check whether its sum with the element in
# 'arr' exceeds np.iinfo(np.int64).min. If so, we have an overflow
# error as well.
i8max = lib.i8max
i8min = iNaT
mask1 = b2 > 0
mask2 = b2 < 0
if not mask1.any():
to_raise = ((i8min - b2 > arr) & not_nan).any()
elif not mask2.any():
to_raise = ((i8max - b2 < arr) & not_nan).any()
else:
to_raise = ((i8max - b2[mask1] < arr[mask1]) & not_nan[mask1]).any() or (
(i8min - b2[mask2] > arr[mask2]) & not_nan[mask2]
).any()
if to_raise:
> raise OverflowError("Overflow in int64 addition")
E OverflowError: Overflow in int64 addition
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/algorithms.py:1112: OverflowError
Steps/Code to reproduce bug
Failures are random, based on the seed.
Here is a minimal reproduction in pure pandas:
importnumpyasnpimportpandasaspdmin_timedelta=np.timedelta64(-(2**63) +1)
max_timedelta=np.timedelta64(2**63-1)
df=pd.DataFrame({"id": [0, 0], "data": [min_timedelta, max_timedelta]})
# This fails:df.diff()
# [traceback...] OverflowError: Overflow in int64 addition# This also fails:df.groupby("id").diff()
# [traceback...] OverflowError: Overflow in int64 addition# Weirdly, the diff succeeds on just the Series:df["data"].diff()
# 0 NaT# 1 -1 days +23:59:59.999999998# Name: data, dtype: timedelta64[ns]
Here's how cudf currently handles those cases:
>>>gdf=cudf.from_pandas(df)
>>>gdf.diff() # Implemented in #9817, but not yet available in branch-22.04
...
AttributeError: DataFrameobjecthasnoattributediff>>>gdf.groupby("id").diff() # Silently overflows, but represents time differently than pandasdata0<NA>1-0days00:00:00.000000002>>>gdf["data"].diff() # Calls through to numba?
...
ValueError: CannotdetermineNumba type of<class'cudf.core.column.timedelta.TimeDeltaColumn'>
...
NotImplementedError: dtypetimedelta64[ns] isnotyetsupportedvia`__cuda_array_interface__`>>>
Expected behavior
This test should not randomly fail. We should fix a seed and/or choose bounds on the data that prevent overflow.
Additional context
Found while reviewing PR #10143.
The text was updated successfully, but these errors were encountered:
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Describe the bug
The test
test_groupby_diff_row_mixed_numerics
failed randomly in CI. The randomly generated values resulted in overflow during subtraction in a datetime column, which caused pandas to fail (cudf may also fail, but that line had not yet run).Traceback here
CI log: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=11.5,GPU_LABEL=driver-495,LINUX_VER=ubuntu20.04,PYTHON=3.9/6540/testReport/junit/cudf.tests/test_groupby/test_groupby_diff_row_mixed_numerics_1_0_5_10_/
Failing line:
cudf/python/cudf/cudf/tests/test_groupby.py
Line 1960 in 0581975
Steps/Code to reproduce bug
Failures are random, based on the seed.
Here is a minimal reproduction in pure pandas:
Here's how cudf currently handles those cases:
Expected behavior
This test should not randomly fail. We should fix a seed and/or choose bounds on the data that prevent overflow.
Additional context
Found while reviewing PR #10143.
The text was updated successfully, but these errors were encountered: