[BUG] test_groupby_diff_row_mixed_numerics failed randomly with overflow in datetime subtraction #10214

bdice · 2022-02-03T22:45:25Z

Describe the bug
The test test_groupby_diff_row_mixed_numerics failed randomly in CI. The randomly generated values resulted in overflow during subtraction in a datetime column, which caused pandas to fail (cudf may also fail, but that line had not yet run).

Traceback here

Error Message

OverflowError: Overflow in int64 addition

Stacktrace

nelem = 10, shift_perc = 0.5, direction = 1

    @pytest.mark.parametrize("nelem", [10, 50, 100, 1000])
    @pytest.mark.parametrize("shift_perc", [0.5, 1.0, 1.5])
    @pytest.mark.parametrize("direction", [1, -1])
    def test_groupby_diff_row_mixed_numerics(nelem, shift_perc, direction):
        t = rand_dataframe(
            dtypes_meta=[
                {"dtype": "int64", "null_frequency": 0, "cardinality": 10},
                {"dtype": "int64", "null_frequency": 0.4, "cardinality": 10},
                {"dtype": "float32", "null_frequency": 0.4, "cardinality": 10},
                {"dtype": "decimal64", "null_frequency": 0.4, "cardinality": 10},
                {
                    "dtype": "datetime64[ns]",
                    "null_frequency": 0.4,
                    "cardinality": 10,
                },
                {
                    "dtype": "timedelta64[ns]",
                    "null_frequency": 0.4,
                    "cardinality": 10,
                },
            ],
            rows=nelem,
            use_threads=False,
        )
        pdf = t.to_pandas()
        gdf = cudf.from_pandas(pdf)
        n_shift = int(nelem * shift_perc) * direction
    
>       expected = pdf.groupby(["0"]).diff(periods=n_shift)

cudf/tests/test_groupby.py:1960: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:948: in wrapper
    return self._python_apply_general(curried, self._obj_with_exclusions)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:1309: in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/ops.py:852: in apply
    res = f(group)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:937: in curried
    return f(x, *args, **kwargs)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/frame.py:8435: in diff
    new_data = self._mgr.diff(n=periods, axis=axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/internals/managers.py:374: in diff
    return self.apply("diff", n=n, axis=axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/internals/managers.py:327: in apply
    applied = getattr(b, f)(**kwargs)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/internals/blocks.py:1754: in diff
    new_values = values - values.shift(n, axis=axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/ops/common.py:69: in new_method
    return method(self, other)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/arrays/datetimelike.py:1336: in __sub__
    result = self._sub_datetime_arraylike(other)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:721: in _sub_datetime_arraylike
    new_values = checked_add_with_arr(self_i8, -other_i8, arr_mask=arr_mask)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arr = array([[ 8466425224569324595, -9223372036854775808,  1593875653891806295,
        -9223372036854775808, -9223372036854775808, -8738661699754510466]])
b = array([[-9223372036854775808, -9223372036854775808, -9223372036854775808,
        -9223372036854775808, -9223372036854775808, -8466425224569324595]])
arr_mask = array([[ True,  True,  True,  True,  True, False]]), b_mask = None

    def checked_add_with_arr(
        arr: np.ndarray,
        b,
        arr_mask: np.ndarray | None = None,
        b_mask: np.ndarray | None = None,
    ) -> np.ndarray:
        """
        Perform array addition that checks for underflow and overflow.
    
        Performs the addition of an int64 array and an int64 integer (or array)
        but checks that they do not result in overflow first. For elements that
        are indicated to be NaN, whether or not there is overflow for that element
        is automatically ignored.
    
        Parameters
        ----------
        arr : array addend.
        b : array or scalar addend.
        arr_mask : np.ndarray[bool] or None, default None
            array indicating which elements to exclude from checking
        b_mask : np.ndarray[bool] or None, default None
            array or scalar indicating which element(s) to exclude from checking
    
        Returns
        -------
        sum : An array for elements x + b for each element x in arr if b is
              a scalar or an array for elements x + y for each element pair
              (x, y) in (arr, b).
    
        Raises
        ------
        OverflowError if any x + y exceeds the maximum or minimum int64 value.
        """
        # For performance reasons, we broadcast 'b' to the new array 'b2'
        # so that it has the same size as 'arr'.
        b2 = np.broadcast_to(b, arr.shape)
        if b_mask is not None:
            # We do the same broadcasting for b_mask as well.
            b2_mask = np.broadcast_to(b_mask, arr.shape)
        else:
            b2_mask = None
    
        # For elements that are NaN, regardless of their value, we should
        # ignore whether they overflow or not when doing the checked add.
        if arr_mask is not None and b2_mask is not None:
            not_nan = np.logical_not(arr_mask | b2_mask)
        elif arr_mask is not None:
            not_nan = np.logical_not(arr_mask)
        elif b_mask is not None:
            not_nan = np.logical_not(b2_mask)
        else:
            not_nan = np.empty(arr.shape, dtype=bool)
            not_nan.fill(True)
    
        # gh-14324: For each element in 'arr' and its corresponding element
        # in 'b2', we check the sign of the element in 'b2'. If it is positive,
        # we then check whether its sum with the element in 'arr' exceeds
        # np.iinfo(np.int64).max. If so, we have an overflow error. If it
        # it is negative, we then check whether its sum with the element in
        # 'arr' exceeds np.iinfo(np.int64).min. If so, we have an overflow
        # error as well.
        i8max = lib.i8max
        i8min = iNaT
    
        mask1 = b2 > 0
        mask2 = b2 < 0
    
        if not mask1.any():
            to_raise = ((i8min - b2 > arr) & not_nan).any()
        elif not mask2.any():
            to_raise = ((i8max - b2 < arr) & not_nan).any()
        else:
            to_raise = ((i8max - b2[mask1] < arr[mask1]) & not_nan[mask1]).any() or (
                (i8min - b2[mask2] > arr[mask2]) & not_nan[mask2]
            ).any()
    
        if to_raise:
>           raise OverflowError("Overflow in int64 addition")
E           OverflowError: Overflow in int64 addition

/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/algorithms.py:1112: OverflowError

CI log: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=11.5,GPU_LABEL=driver-495,LINUX_VER=ubuntu20.04,PYTHON=3.9/6540/testReport/junit/cudf.tests/test_groupby/test_groupby_diff_row_mixed_numerics_1_0_5_10_/

Failing line:

cudf/python/cudf/cudf/tests/test_groupby.py

Line 1960 in 0581975

expected = pdf.groupby(["0"]).diff(periods=n_shift)

Steps/Code to reproduce bug
Failures are random, based on the seed.

Here is a minimal reproduction in pure pandas:

import numpy as np
import pandas as pd
min_timedelta = np.timedelta64(-(2**63) + 1)
max_timedelta = np.timedelta64(2**63 - 1)

df = pd.DataFrame({"id": [0, 0], "data": [min_timedelta, max_timedelta]})

# This fails:
df.diff()
# [traceback...] OverflowError: Overflow in int64 addition

# This also fails:
df.groupby("id").diff()
# [traceback...] OverflowError: Overflow in int64 addition

# Weirdly, the diff succeeds on just the Series:
df["data"].diff()
# 0                           NaT
# 1   -1 days +23:59:59.999999998
# Name: data, dtype: timedelta64[ns]

Here's how cudf currently handles those cases:

>>> gdf = cudf.from_pandas(df)
>>> gdf.diff()  # Implemented in #9817, but not yet available in branch-22.04
...
AttributeError: DataFrame object has no attribute diff
>>> gdf.groupby("id").diff()  # Silently overflows, but represents time differently than pandas
                         data
0                        <NA>
1  -0 days 00:00:00.000000002
>>> gdf["data"].diff()  # Calls through to numba?
...
ValueError: Cannot determine Numba type of <class 'cudf.core.column.timedelta.TimeDeltaColumn'>
...
NotImplementedError: dtype timedelta64[ns] is not yet supported via `__cuda_array_interface__`
>>>

Expected behavior
This test should not randomly fail. We should fix a seed and/or choose bounds on the data that prevent overflow.

Additional context
Found while reviewing PR #10143.

The text was updated successfully, but these errors were encountered:

github-actions · 2022-03-06T00:08:54Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2022-06-04T00:10:41Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

vyasr · 2024-05-17T14:08:38Z

The underlying issue here is that rand_dataframe seeds itself by producing a random number. That should be fixed to either require a user-provided seed, or use a sensible default like 0 to produce reproducible results.

bdice added bug Something isn't working Python Affects Python cuDF API. labels Feb 3, 2022

github-actions bot added the inactive-30d label Mar 6, 2022

github-actions bot added the inactive-90d label Jun 4, 2022

GregoryKimball removed the inactive-90d label Apr 3, 2023

vyasr removed the inactive-30d label Feb 23, 2024

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] test_groupby_diff_row_mixed_numerics failed randomly with overflow in datetime subtraction #10214

[BUG] test_groupby_diff_row_mixed_numerics failed randomly with overflow in datetime subtraction #10214

bdice commented Feb 3, 2022 •

edited

Loading

github-actions bot commented Mar 6, 2022

github-actions bot commented Jun 4, 2022

vyasr commented May 17, 2024

[BUG] test_groupby_diff_row_mixed_numerics failed randomly with overflow in datetime subtraction #10214

[BUG] test_groupby_diff_row_mixed_numerics failed randomly with overflow in datetime subtraction #10214

Comments

bdice commented Feb 3, 2022 • edited Loading

github-actions bot commented Mar 6, 2022

github-actions bot commented Jun 4, 2022

vyasr commented May 17, 2024

bdice commented Feb 3, 2022 •

edited

Loading