WIP,API: Enable full Numpy2 compat #15897

seberg · 2024-05-31T12:53:47Z

This PR has a few more changes with which tests seem to pass (except the new comparison tests) on NumPy 2 (there is a chance of issues on NumPy 1 for now).
I won't have much time next week to update/work on this but could continue after that, please don't hesitate to push/take over.
It's not a big PR huge now, but it probably makes sense to split out some more changes that seem obvious OK.

I'll add some inline comments after opening the PR. It may make sense to split out more of the changes.

Most important checks:

This does include the API break for e.g. Scalar(-1, dtype=np.uint8) to fail, rather than wrap around. I suspect that is fine, but it is a breaking change. (Necessary/helpful because it matches pandas behavior e.g. for uint8_series + 1000 with NumPy 2.
The comparison paths need fixing. I have ommitted them, because I wasn't sure how to best deal with them:
- Getting them perfectly right, requires new comparison loops in libcudf to compare uint64 and int64.
- Getting them perfectly right for scalar integers could be done via logic like if other > np.iinfo(self.dtype).max: return full_series_like(self, dtype=bool, value=False).
- I guess cudf currently used the NumPy promotion, so one could use custom value-based logic. That logic may exist elsewhere or be stolen from pandas. Such logic would be useful to exactly match e.g. pandas.Series.where().

Ping @vyasr, this is to get the ball rolling and simplify discussion.

Tested via (hack but sharing) creating a custom env, since we need a patched cupy version (that artifact is from the conda-forge PR):

set -eu

echo "fetching artifact"
cd conda
curl https://artprodeus21.artifacts.visualstudio.com/A910fa339-c7c2-46e8-a579-7ea247548706/84710dde-1620-425b-80d0-4cf5baca359d/_apis/artifact/cGlwZWxpbmVhcnRpZmFjdDovL2NvbmRhLWZvcmdlL3Byb2plY3RJZC84NDcxMGRkZS0xNjIwLTQyNWItODBkMC00Y2Y1YmFjYTM1OWQvYnVpbGRJZC85NDE1NDgvYXJ0aWZhY3ROYW1lL2NvbmRhX2FydGlmYWN0c18yMDI0MDUyMy4xMC4xX2xpbnV4XzY0X2NfY29tcGlsZXJfdmVyc2lvbjEyY3VkYV9jX2gxMzE0MjdkMmRh0/content?format=zip --output artifact.zip
unzip artifact.zip
rm artifact.zip
mv conda_artifacts_* conda_artifact
cd conda_artifact
unzip *.zip
rm *.zip
cd ../..

ARTIFACT_PATH = `pwd`/conda/conda_artifact/build_artifacts
echo "Modifying channels"
sed -i "s$  - conda-forge$  - file://$ARTIFACT_PATH\n  - conda-forge$" dependencies.yaml
rapids-dependency-file-generator

echo "To finalize, please create a new dev environment, for example with:"
echo ""
echo "conda env create --name cudf_dev --file conda/environments/all_cuda-122_arch-x86_64.yaml"
echo "pip install --force-reinstall --pre pyarrow numpy numba"

xref: rapidsai/build-planning#38

This aligns with NumPy, which deprecated this since a while and raises an error now, for example for `Scalar(-1, dtype=np.uint8)`. Necessary to ensure we give the right errors when promotion doesn't up-cast based on values.

Note that pandas `where` seems to promote the Series based on the value even with NumPy 2. This was never copied by cudf (i.e. an outstanding issue)

Pandas keeps using weak promotion even for strongly typed "scalars" (i.e. 0-d objects). This tries to (mostly) match that, but there may be better ways to do it. I am having difficulty to think of the best way though.

copy-pr-bot · 2024-05-31T12:53:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

seberg · 2024-05-31T12:54:46Z

python/cudf/cudf/core/_internals/where.py

+                raise TypeError(
+                    f"Cannot safely cast non-equivalent "
+                    f"{type(other).__name__} to {source_dtype.name}"
+                )


type(other) can fail now e.g. for int8(200), so need to avoid it.

seberg · 2024-05-31T12:55:17Z

python/cudf/cudf/core/_internals/where.py

+    if (
+        _is_non_decimal_numeric_dtype(source_dtype)
+        and not other_is_scalar  # can-cast fails for Python scalars
+        and _can_cast(other, source_dtype)


It seemed safe/correct to me to always take the other path for scalars?

seberg · 2024-05-31T12:56:59Z

python/cudf/cudf/core/column/numerical.py

+            # Go via NumPy to get the value
+            other = np.array(other)
+            if other.dtype.kind in "ifc":
+                other = other.item()


I am not sure what to do best here. Pandas has a lot of isinstance(x, np.integer) checks, I think. But to match pandas we need to accept 0-D arrays and probably even 0-D tensors.
However, it seemed like this path is only taken for arrays?

seberg · 2024-05-31T12:57:42Z

python/cudf/cudf/core/column/numerical.py

-            if (self.min() >= min_) and (self.max() <= max_):
+            # Use Python floats, which have precise comparison for float64.
+            # NOTE(seberg): it would make sense to limit to the mantissa range.
+            if (float(self.min()) >= min_) and (float(self.max()) <= max_):


Just to explain, NumPy 2 will compare this as float32, but the min_/max_ values can overflow the float32 range.

seberg · 2024-05-31T12:58:19Z

python/cudf/cudf/tests/test_binops.py

+        with pytest.raises(OverflowError):
+            func(random_series)
+
+        return


Pandas raises now on NumPy 2 for these.

seberg · 2024-05-31T12:59:12Z

python/cudf/cudf/tests/test_concat.py

@@ -1280,6 +1280,9 @@ def test_concat_join_empty_dataframes(
 @pytest.mark.parametrize("sort", [True, False])
 @pytest.mark.parametrize("join", ["inner", "outer"])
 @pytest.mark.parametrize("axis", [1])
+@pytest.mark.filterwarnings(
+    "ignore:No codes are cached because compilation:UserWarning"
+)


I am not sure about these, I see them locally with the new cupy, but maybe they are not due to the new cupy version?

seberg · 2024-05-31T12:59:33Z

python/cudf/cudf/tests/test_dtypes.py

@@ -341,7 +341,6 @@ def test_dtype(in_dtype, expect):
        np.complex128,
        complex,
        "S",
-        "a",


Avoids NumPy deprecation warning.

seberg · 2024-05-31T13:02:04Z

python/cudf/cudf/tests/test_transform.py

+    # TODO: Remove again when NumPy is fixed:
+    np_ver = np.lib.NumpyVersion(np.__version__)
+    if dtype == "uint64" and (np_ver == "2.0.0rc1" or np_ver == "2.0.0rc2"):
+        pytest.skip("test fails due to a NumPy bug on 2.0 rc versions")


NumPy in1d is used, but was broken for some uint64/int64 mixes. I have already fixed it in NumPy.

seberg · 2024-05-31T13:02:44Z

python/cudf/cudf/tests/test_unaops.py

+    # The scalar may be out of bounds, so go via array force-cast
+    # NOTE: This is a change in behavior


Suggested change

# The scalar may be out of bounds, so go via array force-cast

# NOTE: This is a change in behavior

Maybe the comment isn't needed anymore (or at least not the note).

seberg · 2024-05-31T13:03:22Z

python/cudf/cudf/utils/dtypes.py

+    else:
+        val = _maybe_convert_to_default_type(
+            cudf.api.types.pandas_dtype(type(val))
+        ).type(val)


This re-arranging enforces that Scalar(-1, dtype=np.uint8) and others out-of-bound versions raise (at least on NumPy 2 or greater).

Splitting out the non API changes from gh-15897, the Scalar API change is required for the tests to pass with NumPy 2, but almost all changes should be relatively straight forward here on their own. (I will add inline comments.) --- This PR does not fix integer comparisons, there are currently no tests that run into these. xref: rapidsai/build-planning#38 Authors: - Sebastian Berg (https://github.com/seberg) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #16141

…#16140) This aligns with NumPy, which deprecated this since a while and raises an error now on NumPy 2, for example for `Scalar(-1, dtype=np.uint8)`. Since it aligns with NumPy, the DeprecationWarning of earlier NumPy versions is inherited for those. This (or similar handling) is required to be compatible with NumPy 2/pandas, since the default needs to be to reject operation when values are out of bounds for e.g. `uint8_series + 1000`, the 1000 should not be silently cast to a `uint8`. --- Split from gh-15897 xref: rapidsai/build-planning#38 Authors: - Sebastian Berg (https://github.com/seberg) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #16140

seberg · 2024-07-15T19:05:38Z

Closing, opened gh-16282 to track the last fix still needed.

seberg added 10 commits May 31, 2024 14:52

API: Change Scalar to raise for invalid Python values

704771d

This aligns with NumPy, which deprecated this since a while and raises an error now, for example for `Scalar(-1, dtype=np.uint8)`. Necessary to ensure we give the right errors when promotion doesn't up-cast based on values.

TST: Skip test when running under buggy NumPy versions

fd27aa0

TST: NumPy now prints np.int8(1), keep using old printing in tests

ecf1558

TST: Silence cupy warnings (TODO: not sure this is needed)

a5a52ba

TST: NumPy is deprecating "a" as a dtype, just skip it

89162d1

MAINT: Use int8(-1) as default categorical as np.uint8(-1) fails

21d25aa

MAINT: Adapt to e.g. uint8(-1) failing now. Mostly in where.

b17af57

Note that pandas `where` seems to promote the Series based on the value even with NumPy 2. This was never copied by cudf (i.e. an outstanding issue)

MAINT: avoid can_cast(pyscalar, dtype) NumPy 2 refuses it (for now)

9e5d611

MAINT: Adapt numerical promotion to NumPy 2 and Pandas 2.2

314d047

Pandas keeps using weak promotion even for strongly typed "scalars" (i.e. 0-d objects). This tries to (mostly) match that, but there may be better ways to do it. I am having difficulty to think of the best way though.

MAINT: Adapt to NumPy weak-promotion in comparisons

d1c7224

github-actions bot added the Python Affects Python cuDF API. label May 31, 2024

seberg commented May 31, 2024

View reviewed changes

TST: Add test for equality, this test fails as of now!

d69d9cb

seberg force-pushed the numpy2-compat branch from b3c376d to d69d9cb Compare May 31, 2024 13:03

mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change breaking Breaking change and removed non-breaking Non-breaking change labels Jun 3, 2024

This was referenced Jul 1, 2024

API: Check for integer overflows when creating scalar form python int #16140

Merged

MAINT: Adapt to NumPy 2 promotion changes #16141

Merged

seberg closed this Jul 15, 2024

seberg deleted the numpy2-compat branch July 15, 2024 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP,API: Enable full Numpy2 compat #15897

WIP,API: Enable full Numpy2 compat #15897

seberg commented May 31, 2024 •

edited by jakirkham

Loading

copy-pr-bot bot commented May 31, 2024

seberg May 31, 2024

seberg May 31, 2024

seberg May 31, 2024

seberg May 31, 2024

seberg May 31, 2024

seberg May 31, 2024

seberg May 31, 2024

seberg May 31, 2024

seberg May 31, 2024

seberg May 31, 2024

seberg commented Jul 15, 2024

		# The scalar may be out of bounds, so go via array force-cast
		# NOTE: This is a change in behavior

WIP,API: Enable full Numpy2 compat #15897

WIP,API: Enable full Numpy2 compat #15897

Conversation

seberg commented May 31, 2024 • edited by jakirkham Loading

copy-pr-bot bot commented May 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented Jul 15, 2024

seberg commented May 31, 2024 •

edited by jakirkham

Loading