Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] test_groupby_diff_row_mixed_numerics failed randomly with overflow in datetime subtraction #10214

Open
bdice opened this issue Feb 3, 2022 · 3 comments
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@bdice
Copy link
Contributor

bdice commented Feb 3, 2022

Describe the bug
The test test_groupby_diff_row_mixed_numerics failed randomly in CI. The randomly generated values resulted in overflow during subtraction in a datetime column, which caused pandas to fail (cudf may also fail, but that line had not yet run).

Traceback here
Error Message

OverflowError: Overflow in int64 addition

Stacktrace

nelem = 10, shift_perc = 0.5, direction = 1

    @pytest.mark.parametrize("nelem", [10, 50, 100, 1000])
    @pytest.mark.parametrize("shift_perc", [0.5, 1.0, 1.5])
    @pytest.mark.parametrize("direction", [1, -1])
    def test_groupby_diff_row_mixed_numerics(nelem, shift_perc, direction):
        t = rand_dataframe(
            dtypes_meta=[
                {"dtype": "int64", "null_frequency": 0, "cardinality": 10},
                {"dtype": "int64", "null_frequency": 0.4, "cardinality": 10},
                {"dtype": "float32", "null_frequency": 0.4, "cardinality": 10},
                {"dtype": "decimal64", "null_frequency": 0.4, "cardinality": 10},
                {
                    "dtype": "datetime64[ns]",
                    "null_frequency": 0.4,
                    "cardinality": 10,
                },
                {
                    "dtype": "timedelta64[ns]",
                    "null_frequency": 0.4,
                    "cardinality": 10,
                },
            ],
            rows=nelem,
            use_threads=False,
        )
        pdf = t.to_pandas()
        gdf = cudf.from_pandas(pdf)
        n_shift = int(nelem * shift_perc) * direction
    
>       expected = pdf.groupby(["0"]).diff(periods=n_shift)

cudf/tests/test_groupby.py:1960: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:948: in wrapper
    return self._python_apply_general(curried, self._obj_with_exclusions)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:1309: in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/ops.py:852: in apply
    res = f(group)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/groupby/groupby.py:937: in curried
    return f(x, *args, **kwargs)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/frame.py:8435: in diff
    new_data = self._mgr.diff(n=periods, axis=axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/internals/managers.py:374: in diff
    return self.apply("diff", n=n, axis=axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/internals/managers.py:327: in apply
    applied = getattr(b, f)(**kwargs)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/internals/blocks.py:1754: in diff
    new_values = values - values.shift(n, axis=axis)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/ops/common.py:69: in new_method
    return method(self, other)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/arrays/datetimelike.py:1336: in __sub__
    result = self._sub_datetime_arraylike(other)
/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/arrays/datetimes.py:721: in _sub_datetime_arraylike
    new_values = checked_add_with_arr(self_i8, -other_i8, arr_mask=arr_mask)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arr = array([[ 8466425224569324595, -9223372036854775808,  1593875653891806295,
        -9223372036854775808, -9223372036854775808, -8738661699754510466]])
b = array([[-9223372036854775808, -9223372036854775808, -9223372036854775808,
        -9223372036854775808, -9223372036854775808, -8466425224569324595]])
arr_mask = array([[ True,  True,  True,  True,  True, False]]), b_mask = None

    def checked_add_with_arr(
        arr: np.ndarray,
        b,
        arr_mask: np.ndarray | None = None,
        b_mask: np.ndarray | None = None,
    ) -> np.ndarray:
        """
        Perform array addition that checks for underflow and overflow.
    
        Performs the addition of an int64 array and an int64 integer (or array)
        but checks that they do not result in overflow first. For elements that
        are indicated to be NaN, whether or not there is overflow for that element
        is automatically ignored.
    
        Parameters
        ----------
        arr : array addend.
        b : array or scalar addend.
        arr_mask : np.ndarray[bool] or None, default None
            array indicating which elements to exclude from checking
        b_mask : np.ndarray[bool] or None, default None
            array or scalar indicating which element(s) to exclude from checking
    
        Returns
        -------
        sum : An array for elements x + b for each element x in arr if b is
              a scalar or an array for elements x + y for each element pair
              (x, y) in (arr, b).
    
        Raises
        ------
        OverflowError if any x + y exceeds the maximum or minimum int64 value.
        """
        # For performance reasons, we broadcast 'b' to the new array 'b2'
        # so that it has the same size as 'arr'.
        b2 = np.broadcast_to(b, arr.shape)
        if b_mask is not None:
            # We do the same broadcasting for b_mask as well.
            b2_mask = np.broadcast_to(b_mask, arr.shape)
        else:
            b2_mask = None
    
        # For elements that are NaN, regardless of their value, we should
        # ignore whether they overflow or not when doing the checked add.
        if arr_mask is not None and b2_mask is not None:
            not_nan = np.logical_not(arr_mask | b2_mask)
        elif arr_mask is not None:
            not_nan = np.logical_not(arr_mask)
        elif b_mask is not None:
            not_nan = np.logical_not(b2_mask)
        else:
            not_nan = np.empty(arr.shape, dtype=bool)
            not_nan.fill(True)
    
        # gh-14324: For each element in 'arr' and its corresponding element
        # in 'b2', we check the sign of the element in 'b2'. If it is positive,
        # we then check whether its sum with the element in 'arr' exceeds
        # np.iinfo(np.int64).max. If so, we have an overflow error. If it
        # it is negative, we then check whether its sum with the element in
        # 'arr' exceeds np.iinfo(np.int64).min. If so, we have an overflow
        # error as well.
        i8max = lib.i8max
        i8min = iNaT
    
        mask1 = b2 > 0
        mask2 = b2 < 0
    
        if not mask1.any():
            to_raise = ((i8min - b2 > arr) & not_nan).any()
        elif not mask2.any():
            to_raise = ((i8max - b2 < arr) & not_nan).any()
        else:
            to_raise = ((i8max - b2[mask1] < arr[mask1]) & not_nan[mask1]).any() or (
                (i8min - b2[mask2] > arr[mask2]) & not_nan[mask2]
            ).any()
    
        if to_raise:
>           raise OverflowError("Overflow in int64 addition")
E           OverflowError: Overflow in int64 addition

/opt/conda/envs/rapids/lib/python3.9/site-packages/pandas/core/algorithms.py:1112: OverflowError

CI log: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=11.5,GPU_LABEL=driver-495,LINUX_VER=ubuntu20.04,PYTHON=3.9/6540/testReport/junit/cudf.tests/test_groupby/test_groupby_diff_row_mixed_numerics_1_0_5_10_/

Failing line:

expected = pdf.groupby(["0"]).diff(periods=n_shift)

Steps/Code to reproduce bug
Failures are random, based on the seed.

Here is a minimal reproduction in pure pandas:

import numpy as np
import pandas as pd
min_timedelta = np.timedelta64(-(2**63) + 1)
max_timedelta = np.timedelta64(2**63 - 1)

df = pd.DataFrame({"id": [0, 0], "data": [min_timedelta, max_timedelta]})

# This fails:
df.diff()
# [traceback...] OverflowError: Overflow in int64 addition

# This also fails:
df.groupby("id").diff()
# [traceback...] OverflowError: Overflow in int64 addition

# Weirdly, the diff succeeds on just the Series:
df["data"].diff()
# 0                           NaT
# 1   -1 days +23:59:59.999999998
# Name: data, dtype: timedelta64[ns]

Here's how cudf currently handles those cases:

>>> gdf = cudf.from_pandas(df)
>>> gdf.diff()  # Implemented in #9817, but not yet available in branch-22.04
...
AttributeError: DataFrame object has no attribute diff
>>> gdf.groupby("id").diff()  # Silently overflows, but represents time differently than pandas
                         data
0                        <NA>
1  -0 days 00:00:00.000000002
>>> gdf["data"].diff()  # Calls through to numba?
...
ValueError: Cannot determine Numba type of <class 'cudf.core.column.timedelta.TimeDeltaColumn'>
...
NotImplementedError: dtype timedelta64[ns] is not yet supported via `__cuda_array_interface__`
>>> 

Expected behavior
This test should not randomly fail. We should fix a seed and/or choose bounds on the data that prevent overflow.

Additional context
Found while reviewing PR #10143.

@bdice bdice added bug Something isn't working Python Affects Python cuDF API. labels Feb 3, 2022
@github-actions
Copy link

github-actions bot commented Mar 6, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Jun 4, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

The underlying issue here is that rand_dataframe seeds itself by producing a random number. That should be fixed to either require a user-provided seed, or use a sensible default like 0 to produce reproducible results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

3 participants