-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP,API: Enable full Numpy2 compat #15897
Conversation
This aligns with NumPy, which deprecated this since a while and raises an error now, for example for `Scalar(-1, dtype=np.uint8)`. Necessary to ensure we give the right errors when promotion doesn't up-cast based on values.
Note that pandas `where` seems to promote the Series based on the value even with NumPy 2. This was never copied by cudf (i.e. an outstanding issue)
Pandas keeps using weak promotion even for strongly typed "scalars" (i.e. 0-d objects). This tries to (mostly) match that, but there may be better ways to do it. I am having difficulty to think of the best way though.
raise TypeError( | ||
f"Cannot safely cast non-equivalent " | ||
f"{type(other).__name__} to {source_dtype.name}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type(other)
can fail now e.g. for int8(200)
, so need to avoid it.
if ( | ||
_is_non_decimal_numeric_dtype(source_dtype) | ||
and not other_is_scalar # can-cast fails for Python scalars | ||
and _can_cast(other, source_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed safe/correct to me to always take the other path for scalars?
# Go via NumPy to get the value | ||
other = np.array(other) | ||
if other.dtype.kind in "ifc": | ||
other = other.item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what to do best here. Pandas has a lot of isinstance(x, np.integer)
checks, I think. But to match pandas we need to accept 0-D arrays and probably even 0-D tensors.
However, it seemed like this path is only taken for arrays?
if (self.min() >= min_) and (self.max() <= max_): | ||
# Use Python floats, which have precise comparison for float64. | ||
# NOTE(seberg): it would make sense to limit to the mantissa range. | ||
if (float(self.min()) >= min_) and (float(self.max()) <= max_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to explain, NumPy 2 will compare this as float32, but the min_
/max_
values can overflow the float32 range.
with pytest.raises(OverflowError): | ||
func(random_series) | ||
|
||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas raises now on NumPy 2 for these.
@@ -1280,6 +1280,9 @@ def test_concat_join_empty_dataframes( | |||
@pytest.mark.parametrize("sort", [True, False]) | |||
@pytest.mark.parametrize("join", ["inner", "outer"]) | |||
@pytest.mark.parametrize("axis", [1]) | |||
@pytest.mark.filterwarnings( | |||
"ignore:No codes are cached because compilation:UserWarning" | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about these, I see them locally with the new cupy, but maybe they are not due to the new cupy version?
@@ -341,7 +341,6 @@ def test_dtype(in_dtype, expect): | |||
np.complex128, | |||
complex, | |||
"S", | |||
"a", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoids NumPy deprecation warning.
# TODO: Remove again when NumPy is fixed: | ||
np_ver = np.lib.NumpyVersion(np.__version__) | ||
if dtype == "uint64" and (np_ver == "2.0.0rc1" or np_ver == "2.0.0rc2"): | ||
pytest.skip("test fails due to a NumPy bug on 2.0 rc versions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy in1d
is used, but was broken for some uint64/int64 mixes. I have already fixed it in NumPy.
# The scalar may be out of bounds, so go via array force-cast | ||
# NOTE: This is a change in behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The scalar may be out of bounds, so go via array force-cast | |
# NOTE: This is a change in behavior |
Maybe the comment isn't needed anymore (or at least not the note).
else: | ||
val = _maybe_convert_to_default_type( | ||
cudf.api.types.pandas_dtype(type(val)) | ||
).type(val) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This re-arranging enforces that Scalar(-1, dtype=np.uint8)
and others out-of-bound versions raise (at least on NumPy 2 or greater).
Splitting out the non API changes from gh-15897, the Scalar API change is required for the tests to pass with NumPy 2, but almost all changes should be relatively straight forward here on their own. (I will add inline comments.) --- This PR does not fix integer comparisons, there are currently no tests that run into these. xref: rapidsai/build-planning#38 Authors: - Sebastian Berg (https://github.com/seberg) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #16141
…#16140) This aligns with NumPy, which deprecated this since a while and raises an error now on NumPy 2, for example for `Scalar(-1, dtype=np.uint8)`. Since it aligns with NumPy, the DeprecationWarning of earlier NumPy versions is inherited for those. This (or similar handling) is required to be compatible with NumPy 2/pandas, since the default needs to be to reject operation when values are out of bounds for e.g. `uint8_series + 1000`, the 1000 should not be silently cast to a `uint8`. --- Split from gh-15897 xref: rapidsai/build-planning#38 Authors: - Sebastian Berg (https://github.com/seberg) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #16140
Closing, opened gh-16282 to track the last fix still needed. |
This PR has a few more changes with which tests seem to pass (except the new comparison tests) on NumPy 2 (there is a chance of issues on NumPy 1 for now).
I won't have much time next week to update/work on this but could continue after that, please don't hesitate to push/take over.
It's not a big PR huge now, but it probably makes sense to split out some more changes that seem obvious OK.
I'll add some inline comments after opening the PR. It may make sense to split out more of the changes.
Most important checks:
Scalar(-1, dtype=np.uint8)
to fail, rather than wrap around. I suspect that is fine, but it is a breaking change. (Necessary/helpful because it matches pandas behavior e.g. foruint8_series + 1000
with NumPy 2.libcudf
to compareuint64
andint64
.if other > np.iinfo(self.dtype).max: return full_series_like(self, dtype=bool, value=False)
.pandas.Series.where()
.Ping @vyasr, this is to get the ball rolling and simplify discussion.
Tested via (hack but sharing) creating a custom env, since we need a patched cupy version (that artifact is from the conda-forge PR):
xref: rapidsai/build-planning#38