[BUG] Incorrect broadcasted equality for scalar comparison to int64 columns #12072

wence- · 2022-11-04T15:13:13Z

Describe the bug

Equality of a NumericalColumn with large int64 values against another large int64 value can wrongly advertise false positives.

Steps/Code to reproduce bug

import cudf
import numpy as np

s = cudf.Series([2**63 - 10, 2**63 - 100], dtype=np.int64)
print(s == np.int64(2**63 - 1))
# 0    True
# 1    True
# dtype: bool

Expected behavior

Equality for int types should be correct.

Why does this happen?

Before heading into libcudf we need to turn the scalar host value into a device scalar, for which a dtype must be deduced. This is done in NumericalColumn.normalize_binop_value.

The relevant code is:

        other_dtype = np.min_scalar_type(other)
        if other_dtype.kind in {"b", "i", "u", "f"}:
            if isinstance(other, cudf.Scalar):
                return other
            other_dtype = np.promote_types(self.dtype, other_dtype)

So first we inspect the value and ask numpy to come up with a dtype that can represent this value, then we promote the two types of the column and the deduced dtype to a common type, cast our target value to that and continue.

What goes wrong?

val = np.int64(2**63 - 10)
mintype = np.min_scalar_type(val) # => np.uint64 (oh no!)
otype = np.promote_types(np.int64, np.uint64) # => float64 (oh no!)

Numpy prefers unsigned over signed types min_scalar_type which is problematic, because if the column type is signed, then we think we can no longer represent the RHS value in the column type. So we upcast to float (lossy) and call into libcudf with binaryop(column, device_scalar_as_double(val)) and now we get standard C++ type promotion on the pointwise equality comparison.

Environment overview (please complete the following information)

cudf.__version__ => '22.12.00a+193.g991c86b13a'

The text was updated successfully, but these errors were encountered:

The type normalisation applied before heading into libcudf previously had slightly unexpected consequences for large int64 values. If not providing a `cudf.Scalar`, a bare `int64` scalar would be cast to `uint64` and then normal numpy type promotion would unify to `float64`. This is lossy, since int64 to float64 is neither surjective nor injective. To avoid this, try very hard to maintain the dtype of the object coming in, and match pandas behaviour by applying numpy type promotion rules via `numpy.result_type`. - Closes #5938 - Closes #7389 - Closes #12072 - Closes #12092 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #12074

wence- added bug Something isn't working Needs Triage Need team to review and classify labels Nov 4, 2022

wence- self-assigned this Nov 4, 2022

wence- added the Python Affects Python cuDF API. label Nov 4, 2022

wence- mentioned this issue Nov 4, 2022

[ENH/QST]: Behaviour of type promotion in __setitem__ #12039

Open

6 tasks

wence- mentioned this issue Nov 4, 2022

Fix type promotion edge cases in numerical binops #12074

Merged

3 tasks

GregoryKimball added 2 - In Progress Currently a work in progress and removed Needs Triage Need team to review and classify labels Nov 14, 2022

rapids-bot bot closed this as completed in #12074 Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Incorrect broadcasted equality for scalar comparison to int64 columns #12072

[BUG] Incorrect broadcasted equality for scalar comparison to int64 columns #12072

wence- commented Nov 4, 2022 •

edited

Loading

[BUG] Incorrect broadcasted equality for scalar comparison to int64 columns #12072

[BUG] Incorrect broadcasted equality for scalar comparison to int64 columns #12072

Comments

wence- commented Nov 4, 2022 • edited Loading

wence- commented Nov 4, 2022 •

edited

Loading