Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incorrect broadcasted equality for scalar comparison to int64 columns #12072

Closed
wence- opened this issue Nov 4, 2022 · 0 comments · Fixed by #12074
Closed

[BUG] Incorrect broadcasted equality for scalar comparison to int64 columns #12072

wence- opened this issue Nov 4, 2022 · 0 comments · Fixed by #12074
Assignees
Labels
2 - In Progress Currently a work in progress bug Something isn't working Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Nov 4, 2022

Describe the bug

Equality of a NumericalColumn with large int64 values against another large int64 value can wrongly advertise false positives.

Steps/Code to reproduce bug

import cudf
import numpy as np

s = cudf.Series([2**63 - 10, 2**63 - 100], dtype=np.int64)
print(s == np.int64(2**63 - 1))
# 0    True
# 1    True
# dtype: bool

Expected behavior

Equality for int types should be correct.

Why does this happen?

Before heading into libcudf we need to turn the scalar host value into a device scalar, for which a dtype must be deduced. This is done in NumericalColumn.normalize_binop_value.

The relevant code is:

        other_dtype = np.min_scalar_type(other)
        if other_dtype.kind in {"b", "i", "u", "f"}:
            if isinstance(other, cudf.Scalar):
                return other
            other_dtype = np.promote_types(self.dtype, other_dtype)

So first we inspect the value and ask numpy to come up with a dtype that can represent this value, then we promote the two types of the column and the deduced dtype to a common type, cast our target value to that and continue.

What goes wrong?

val = np.int64(2**63 - 10)
mintype = np.min_scalar_type(val) # => np.uint64 (oh no!)
otype = np.promote_types(np.int64, np.uint64) # => float64 (oh no!)

Numpy prefers unsigned over signed types min_scalar_type which is problematic, because if the column type is signed, then we think we can no longer represent the RHS value in the column type. So we upcast to float (lossy) and call into libcudf with binaryop(column, device_scalar_as_double(val)) and now we get standard C++ type promotion on the pointwise equality comparison.

Environment overview (please complete the following information)

cudf.__version__ => '22.12.00a+193.g991c86b13a'

@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify labels Nov 4, 2022
@wence- wence- self-assigned this Nov 4, 2022
@wence- wence- added the Python Affects Python cuDF API. label Nov 4, 2022
@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress and removed Needs Triage Need team to review and classify labels Nov 14, 2022
rapids-bot bot pushed a commit that referenced this issue Nov 16, 2022
The type normalisation applied before heading into libcudf previously had slightly unexpected consequences for large int64 values. If not providing a `cudf.Scalar`, a bare `int64` scalar would be cast to `uint64` and then normal numpy type promotion would unify to `float64`. This is lossy, since int64 to float64 is neither surjective nor injective.

To avoid this, try very hard to maintain the dtype of the object coming in, and match pandas behaviour by applying numpy type promotion rules via `numpy.result_type`. 

- Closes #5938
- Closes #7389
- Closes #12072
- Closes #12092

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #12074
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants