Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to support Numpy >= 1.24 #13325

Merged
merged 22 commits into from
May 25, 2023
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions python/cudf/cudf/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@

try:
from cubinlinker.patch import patch_numba_linker_if_needed
from ptxcompiler.patch import patch_numba_codegen_if_needed
except ImportError:
pass
else:
Expand All @@ -96,6 +97,7 @@

_setup_numba_linker(_PTX_FILE)

patch_numba_codegen_if_needed()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hack will go away

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be removed from this PR itself. Now that we've verified numpy 1.24 support, I would recommend removing the numba-related changes in this PR that you're using in order to allow running numba 0.57 (which is necessary to use numpy 1.24). We'll still get it tested because our CUDA 12 wheel builds will patch us to use 0.57 anyway (but with CUDA 12 we don't use cubinlinker/ptxcompiler so we don't need any edits for those). Then when we bump our numba to 0.57 tests should pass thanks to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one other question on this PR, would wait to make changes here until everything else is resolved in case you need to run more tests locally.

del patch_numba_linker_if_needed

cuda.set_memory_manager(RMMNumbaManager)
Expand Down
10 changes: 6 additions & 4 deletions python/cudf/cudf/core/column/numerical.py
Original file line number Diff line number Diff line change
Expand Up @@ -766,10 +766,12 @@ def _normalize_find_and_replace_input(
if len(col_to_normalize) == 1:
if cudf._lib.scalar._is_null_host_scalar(col_to_normalize[0]):
return normalized_column.astype(input_column_dtype)
else:
col_to_normalize_casted = input_column_dtype.type(
col_to_normalize[0]
)
if np.isinf(col_to_normalize[0]):
return normalized_column
col_to_normalize_casted = np.array(col_to_normalize[0]).astype(
input_column_dtype
)

if not np.isnan(col_to_normalize_casted) and (
col_to_normalize_casted != col_to_normalize[0]
):
Expand Down
4 changes: 2 additions & 2 deletions python/cudf/cudf/tests/test_column.py
Original file line number Diff line number Diff line change
Expand Up @@ -398,8 +398,8 @@ def test_column_view_string_slice(slc):
cudf.core.column.as_column([], dtype="uint8"),
),
(
cp.array([453], dtype="uint8"),
cudf.core.column.as_column([453], dtype="uint8"),
cp.array([255], dtype="uint8"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it doesn't matter what value we choose here? Just wondering if it's important to use 453-256.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters. I use 255 just because it's `np.iinfo(uint8).max'

cudf.core.column.as_column([255], dtype="uint8"),
),
],
)
Expand Down
6 changes: 3 additions & 3 deletions python/cudf/cudf/tests/test_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,8 +150,8 @@ def make_all_numeric_extremes_dataframe():
np_type = pdf_dtypes[gdf_dtype]
if np.issubdtype(np_type, np.integer):
itype = np.iinfo(np_type)
extremes = [0, +1, -1, itype.min, itype.max]
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20]
extremes = [itype.min, itype.max]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change the comments at the beginning of these tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate? Is this a task you want to accomplish in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that 0, +1, and -1 aren't extrema for integer types, but is there a reason you remove them from these tests? I suppose perhaps that np.uint8(-1) now raises OverflowError or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.uint8(-1) in particular raises a deprecation warning. But the reason I gave up on this test was because I couldn't figure out why this was happening:

In [2]: np.array([-1]).astype("uint64")
Out[2]: array([18446744073709551615], dtype=uint64)

In [3]: np.array([18446744073709551615]).astype("uint64")
Out[3]: array([18446744073709551615], dtype=uint64)

In [4]: np.array([-1, 18446744073709551615]).astype("uint64")
<ipython-input-4-03014ed268fc>:1: RuntimeWarning: invalid value encountered in cast
  np.array([-1, 18446744073709551615]).astype("uint64")
Out[4]: array([18446744073709551615,                    0], dtype=uint64)

I've gone ahead and filtered out that warning from this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the reason I gave up on this test was because I couldn't figure out why this was happening:

np.array([-1, 2**64 - 1]).dtype == "float64"

which is lossy.

df[gdf_dtype] = np.array(extremes * 10, dtype=np_type)[:20]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to change from 4 pairs of extrema to 10?

else:
ftype = np.finfo(np_type)
extremes = [
Expand Down Expand Up @@ -1433,7 +1433,7 @@ def test_csv_reader_hexadecimal_overflow(np_dtype, gdf_dtype):

gdf = read_csv(StringIO(buffer), dtype=[gdf_dtype], names=["hex_int"])

expected = np.array(values, dtype=np_dtype)
expected = np.array(values).astype(np_dtype)
actual = gdf["hex_int"].to_numpy()
np.testing.assert_array_equal(expected, actual)

Expand Down
9 changes: 1 addition & 8 deletions python/cudf/cudf/tests/test_feather.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,23 +15,16 @@
@pytest.fixture(params=[0, 1, 10, 100])
def pdf(request):
types = NUMERIC_TYPES + ["bool"]
typer = {"col_" + val: val for val in types}
ncols = len(types)
nrows = request.param

# Create a pandas dataframe with random data of mixed types
test_pdf = pd.DataFrame(
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)],
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"),
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the columns in this dataframe now have type int64, no? Since they are never downcast with astype.

)
# Delete the name of the column index, and rename the row index
test_pdf.columns.name = None
test_pdf.index.name = "index"

# Cast all the column dtypes to objects, rename them, and then cast to
# appropriate types
test_pdf = test_pdf.astype("object").astype(typer)

# Create non-numeric categorical data otherwise may get typecasted
data = [ascii_letters[np.random.randint(0, 52)] for i in range(nrows)]
test_pdf["col_category"] = pd.Series(data, dtype="category")
Expand Down
4 changes: 1 addition & 3 deletions python/cudf/cudf/tests/test_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,11 @@ def make_numeric_dataframe(nrows, dtype):
def pdf(request):
types = NUMERIC_TYPES + DATETIME_TYPES + ["bool"]
typer = {"col_" + val: val for val in types}
ncols = len(types)
nrows = request.param

# Create a pandas dataframe with random data of mixed types
test_pdf = pd.DataFrame(
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)],
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"),
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one in contrast is cast to the appropriate type.

)
# Delete the name of the column index, and rename the row index
test_pdf.columns.name = None
Expand Down
6 changes: 5 additions & 1 deletion python/cudf/cudf/tests/test_numerical.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2021-2022, NVIDIA CORPORATION.
# Copyright (c) 2021-2023, NVIDIA CORPORATION.

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -194,6 +194,7 @@ def test_to_numeric_downcast_int(data, downcast):
assert_eq(expected, got)


@pytest.mark.filterwarnings("ignore:invalid value encountered in cast")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of applying this to the whole test, can we just wrap the pd.to_numeric call? This doesn't affect the cudf.to_numeric call, does it?

Also, should we be handling the warning conditionally? i.e. I assuming this happens when trying to downcast a signed to an unsigned type or something?

@pytest.mark.parametrize(
"data",
[
Expand Down Expand Up @@ -223,6 +224,7 @@ def test_to_numeric_downcast_float(data, downcast):
assert_eq(expected, got)


@pytest.mark.filterwarnings("ignore:invalid value encountered in cast")
@pytest.mark.parametrize(
"data",
[
Expand All @@ -245,6 +247,7 @@ def test_to_numeric_downcast_large_float(data, downcast):
assert_eq(expected, got)


@pytest.mark.filterwarnings("ignore:overflow encountered in cast")
@pytest.mark.parametrize(
"data",
[
Expand Down Expand Up @@ -325,6 +328,7 @@ def test_to_numeric_downcast_string_float(data, downcast):
assert_eq(expected, got)


@pytest.mark.filterwarnings("ignore:overflow encountered in cast")
@pytest.mark.parametrize(
"data",
[
Expand Down
17 changes: 2 additions & 15 deletions python/cudf/cudf/tests/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,14 +69,11 @@ def simple_pdf(request):
"float32",
"float64",
]
typer = {"col_" + val: val for val in types}
ncols = len(types)
nrows = request.param

# Create a pandas dataframe with random data of mixed types
test_pdf = pd.DataFrame(
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)],
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"),
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types},
# Need to ensure that this index is not a RangeIndex to get the
# expected round-tripping behavior from Parquet reader/writer.
index=pd.Index(list(range(nrows))),
Expand All @@ -85,10 +82,6 @@ def simple_pdf(request):
test_pdf.columns.name = None
test_pdf.index.name = "test_index"

# Cast all the column dtypes to objects, rename them, and then cast to
# appropriate types
test_pdf = test_pdf.astype("object").astype(typer)
wence- marked this conversation as resolved.
Show resolved Hide resolved

return test_pdf


Expand All @@ -115,13 +108,11 @@ def build_pdf(num_columns, day_resolution_timestamps):
"datetime64[us]",
"str",
]
typer = {"col_" + val: val for val in types}
ncols = len(types)
nrows = num_columns.param

# Create a pandas dataframe with random data of mixed types
test_pdf = pd.DataFrame(
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)],
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types},
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"),
# Need to ensure that this index is not a RangeIndex to get the
# expected round-tripping behavior from Parquet reader/writer.
Expand All @@ -131,10 +122,6 @@ def build_pdf(num_columns, day_resolution_timestamps):
test_pdf.columns.name = None
test_pdf.index.name = "test_index"

# Cast all the column dtypes to objects, rename them, and then cast to
# appropriate types
test_pdf = test_pdf.astype(typer)
wence- marked this conversation as resolved.
Show resolved Hide resolved

# make datetime64's a little more interesting by increasing the range of
# dates note that pandas will convert these to ns timestamps, so care is
# taken to avoid overflowing a ns timestamp. There is also the ability to
Expand Down
13 changes: 6 additions & 7 deletions python/cudf/cudf/tests/test_rank.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020-2022, NVIDIA CORPORATION.
# Copyright (c) 2020-2023, NVIDIA CORPORATION.

from itertools import chain, combinations_with_replacement, product

Expand Down Expand Up @@ -125,7 +125,7 @@ def test_rank_error_arguments(pdf):
np.full((3,), np.inf),
np.full((3,), -np.inf),
]
sort_dtype_args = [np.int32, np.int64, np.float32, np.float64]
sort_dtype_args = [np.float32, np.float64]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we now don't run some tests with integer dtypes. Is it that they don't make sense any more?



@pytest.mark.parametrize(
Expand All @@ -139,13 +139,12 @@ def test_rank_error_arguments(pdf):
)
def test_series_rank_combinations(elem, dtype):
np.random.seed(0)
aa = np.fromiter(chain.from_iterable(elem), dtype=dtype)
gdf = DataFrame()
gdf["a"] = aa = np.fromiter(chain.from_iterable(elem), np.float64).astype(
dtype
)
ranked_gs = gdf["a"].rank(method="first")
df = pd.DataFrame()
gdf["a"] = aa
df["a"] = aa
ranked_gs = gdf["a"].rank(method="first")
ranked_ps = df["a"].rank(method="first")
# Check
assert_eq(ranked_ps, ranked_gs.to_pandas())
assert_eq(ranked_ps, ranked_gs)
13 changes: 10 additions & 3 deletions python/cudf/cudf/tests/test_replace.py
Original file line number Diff line number Diff line change
Expand Up @@ -944,8 +944,15 @@ def test_numeric_series_replace_dtype(series_dtype, replacement):
psr = pd.Series([0, 1, 2, 3, 4, 5], dtype=series_dtype)
sr = cudf.from_pandas(psr)

if sr.dtype.kind in "ui":
can_replace = np.array([replacement])[0].is_integer() and np.can_cast(
int(replacement), sr.dtype
)
else:
can_replace = np.can_cast(replacement, sr.dtype)

# Both Scalar
if sr.dtype.type(replacement) != replacement:
if not can_replace:
with pytest.raises(TypeError):
sr.replace(1, replacement)
else:
Expand All @@ -954,7 +961,7 @@ def test_numeric_series_replace_dtype(series_dtype, replacement):
assert_eq(expect, got)

# to_replace is a list, replacement is a scalar
if sr.dtype.type(replacement) != replacement:
if not can_replace:
with pytest.raises(TypeError):

sr.replace([2, 3], replacement)
Expand All @@ -974,7 +981,7 @@ def test_numeric_series_replace_dtype(series_dtype, replacement):
# Both lists of equal length
if (
np.dtype(type(replacement)).kind == "f" and sr.dtype.kind in {"i", "u"}
) or (sr.dtype.type(replacement) != replacement):
) or (not can_replace):
with pytest.raises(TypeError):
sr.replace([2, 3], [replacement, replacement])
else:
Expand Down
4 changes: 2 additions & 2 deletions python/cudf/cudf/tests/test_sparse_df.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2018-2022, NVIDIA CORPORATION.
# Copyright (c) 2018-2023, NVIDIA CORPORATION.

import numpy as np

Expand All @@ -7,7 +7,7 @@

def test_to_dense_array():
data = np.random.random(8)
mask = np.asarray([0b11010110], dtype=np.byte)
mask = np.asarray([0b11010110]).astype(np.byte)

sr = Series.from_masked_array(data=data, mask=mask, null_count=3)
assert sr.has_nulls
Expand Down
8 changes: 6 additions & 2 deletions python/cudf/cudf/tests/test_unaops.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2019-2022, NVIDIA CORPORATION.
# Copyright (c) 2019-2023, NVIDIA CORPORATION.

import itertools
import operator
Expand Down Expand Up @@ -79,9 +79,13 @@ def generate_valid_scalar_unaop_combos():

@pytest.mark.parametrize("slr,dtype,op", generate_valid_scalar_unaop_combos())
def test_scalar_unary_operations(slr, dtype, op):
slr_host = cudf.dtype(dtype).type(slr)
slr_host = np.array([slr])[0].astype(cudf.dtype(dtype))
slr_device = cudf.Scalar(slr, dtype=dtype)

if op.__name__ == "neg" and np.dtype(dtype).kind == "u":
# TODO: what do we want to do here?
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy is fine with this right? Right? Negation of unsigned integers is totally well-defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well:

In [2]: -np.uint16(1)
<ipython-input-2-5156426c8f88>:1: RuntimeWarning: overflow encountered in scalar negative
  -np.uint16(1)
Out[2]: 65535

Should we just go ahead and ignore that warning in this test? (I've resorted to doing that in most other cases)


expect = op(slr_host)
got = op(slr_device)

Expand Down
2 changes: 1 addition & 1 deletion python/cudf/cudf/utils/queryutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ def query_compile(expr):
key "args" is a sequence of name of the arguments.
"""

funcid = f"queryexpr_{np.uintp(hash(expr)):x}"
funcid = f"queryexpr_{np.uintp(abs(hash(expr))):x}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, hash returns in the semi-open interval [-2**63, 2**63), but abs folds this to the closed interval [0, 2**63] (so you alias just shy of 50% of the values). Instead, you want to shift, I suspect, and then you don't need numpy in the loop at all:

Suggested change
funcid = f"queryexpr_{np.uintp(abs(hash(expr))):x}"
funcid = f"queryexpr_{hash(expr) + 2**63:x}"

That said, strings are hashable, so this seems like a weird way of constructing a cache key (it's somehow deliberately making it more likely that you get hash collisions and produce the wrong value).

I would have thought that this would do the trick:

@functools.cache
def query_compile(expr):
    name  = "queryexpr" # these are only looked up locally so names can collide
    info = query_parser(expr)
    fn = query_builder(info, name)
    args = info["args"]
    devicefn = cudf.jit(device=True)(fn)
    kernel = _wrap_query_expr(f"kernel_{name}", devicefn, args)
    info["kernel"] = kernel
    return info

# Load cache
compiled = _cache.get(funcid)
# Cache not found
Expand Down