-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to support Numpy >= 1.24 #13325
Changes from 7 commits
4ad5832
809cc6e
4e4ad82
f8da5dc
e79b8cd
b94427f
66365b3
ba3ede7
f884eb7
42d6a74
0eb153c
d75986b
2dfb939
860f439
8221628
abdc026
37c9514
c4bccf4
e776efc
0bff932
e4b43ac
9a68604
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -398,8 +398,8 @@ def test_column_view_string_slice(slc): | |
cudf.core.column.as_column([], dtype="uint8"), | ||
), | ||
( | ||
cp.array([453], dtype="uint8"), | ||
cudf.core.column.as_column([453], dtype="uint8"), | ||
cp.array([255], dtype="uint8"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess it doesn't matter what value we choose here? Just wondering if it's important to use 453-256. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it matters. I use |
||
cudf.core.column.as_column([255], dtype="uint8"), | ||
), | ||
], | ||
) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -150,8 +150,8 @@ def make_all_numeric_extremes_dataframe(): | |
np_type = pdf_dtypes[gdf_dtype] | ||
if np.issubdtype(np_type, np.integer): | ||
itype = np.iinfo(np_type) | ||
extremes = [0, +1, -1, itype.min, itype.max] | ||
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20] | ||
extremes = [itype.min, itype.max] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to change the comments at the beginning of these tests There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate? Is this a task you want to accomplish in this PR? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know that 0, +1, and -1 aren't extrema for integer types, but is there a reason you remove them from these tests? I suppose perhaps that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In [2]: np.array([-1]).astype("uint64")
Out[2]: array([18446744073709551615], dtype=uint64)
In [3]: np.array([18446744073709551615]).astype("uint64")
Out[3]: array([18446744073709551615], dtype=uint64)
In [4]: np.array([-1, 18446744073709551615]).astype("uint64")
<ipython-input-4-03014ed268fc>:1: RuntimeWarning: invalid value encountered in cast
np.array([-1, 18446744073709551615]).astype("uint64")
Out[4]: array([18446744073709551615, 0], dtype=uint64) I've gone ahead and filtered out that warning from this test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
which is lossy. |
||
df[gdf_dtype] = np.array(extremes * 10, dtype=np_type)[:20] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any reason to change from 4 pairs of extrema to 10? |
||
else: | ||
ftype = np.finfo(np_type) | ||
extremes = [ | ||
|
@@ -1433,7 +1433,7 @@ def test_csv_reader_hexadecimal_overflow(np_dtype, gdf_dtype): | |
|
||
gdf = read_csv(StringIO(buffer), dtype=[gdf_dtype], names=["hex_int"]) | ||
|
||
expected = np.array(values, dtype=np_dtype) | ||
expected = np.array(values).astype(np_dtype) | ||
actual = gdf["hex_int"].to_numpy() | ||
np.testing.assert_array_equal(expected, actual) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,23 +15,16 @@ | |
@pytest.fixture(params=[0, 1, 10, 100]) | ||
def pdf(request): | ||
types = NUMERIC_TYPES + ["bool"] | ||
typer = {"col_" + val: val for val in types} | ||
ncols = len(types) | ||
nrows = request.param | ||
|
||
# Create a pandas dataframe with random data of mixed types | ||
test_pdf = pd.DataFrame( | ||
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)], | ||
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"), | ||
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of the columns in this dataframe now have type |
||
) | ||
# Delete the name of the column index, and rename the row index | ||
test_pdf.columns.name = None | ||
test_pdf.index.name = "index" | ||
|
||
# Cast all the column dtypes to objects, rename them, and then cast to | ||
# appropriate types | ||
test_pdf = test_pdf.astype("object").astype(typer) | ||
|
||
# Create non-numeric categorical data otherwise may get typecasted | ||
data = [ascii_letters[np.random.randint(0, 52)] for i in range(nrows)] | ||
test_pdf["col_category"] = pd.Series(data, dtype="category") | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -32,13 +32,11 @@ def make_numeric_dataframe(nrows, dtype): | |
def pdf(request): | ||
types = NUMERIC_TYPES + DATETIME_TYPES + ["bool"] | ||
typer = {"col_" + val: val for val in types} | ||
ncols = len(types) | ||
nrows = request.param | ||
|
||
# Create a pandas dataframe with random data of mixed types | ||
test_pdf = pd.DataFrame( | ||
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)], | ||
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"), | ||
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This one in contrast is cast to the appropriate type. |
||
) | ||
# Delete the name of the column index, and rename the row index | ||
test_pdf.columns.name = None | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Copyright (c) 2021-2022, NVIDIA CORPORATION. | ||
# Copyright (c) 2021-2023, NVIDIA CORPORATION. | ||
|
||
import numpy as np | ||
import pandas as pd | ||
|
@@ -194,6 +194,7 @@ def test_to_numeric_downcast_int(data, downcast): | |
assert_eq(expected, got) | ||
|
||
|
||
@pytest.mark.filterwarnings("ignore:invalid value encountered in cast") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of applying this to the whole test, can we just wrap the Also, should we be handling the warning conditionally? i.e. I assuming this happens when trying to downcast a signed to an unsigned type or something? |
||
@pytest.mark.parametrize( | ||
"data", | ||
[ | ||
|
@@ -223,6 +224,7 @@ def test_to_numeric_downcast_float(data, downcast): | |
assert_eq(expected, got) | ||
|
||
|
||
@pytest.mark.filterwarnings("ignore:invalid value encountered in cast") | ||
@pytest.mark.parametrize( | ||
"data", | ||
[ | ||
|
@@ -245,6 +247,7 @@ def test_to_numeric_downcast_large_float(data, downcast): | |
assert_eq(expected, got) | ||
|
||
|
||
@pytest.mark.filterwarnings("ignore:overflow encountered in cast") | ||
@pytest.mark.parametrize( | ||
"data", | ||
[ | ||
|
@@ -325,6 +328,7 @@ def test_to_numeric_downcast_string_float(data, downcast): | |
assert_eq(expected, got) | ||
|
||
|
||
@pytest.mark.filterwarnings("ignore:overflow encountered in cast") | ||
@pytest.mark.parametrize( | ||
"data", | ||
[ | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Copyright (c) 2020-2022, NVIDIA CORPORATION. | ||
# Copyright (c) 2020-2023, NVIDIA CORPORATION. | ||
|
||
from itertools import chain, combinations_with_replacement, product | ||
|
||
|
@@ -125,7 +125,7 @@ def test_rank_error_arguments(pdf): | |
np.full((3,), np.inf), | ||
np.full((3,), -np.inf), | ||
] | ||
sort_dtype_args = [np.int32, np.int64, np.float32, np.float64] | ||
sort_dtype_args = [np.float32, np.float64] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This means we now don't run some tests with integer dtypes. Is it that they don't make sense any more? |
||
|
||
|
||
@pytest.mark.parametrize( | ||
|
@@ -139,13 +139,12 @@ def test_rank_error_arguments(pdf): | |
) | ||
def test_series_rank_combinations(elem, dtype): | ||
np.random.seed(0) | ||
aa = np.fromiter(chain.from_iterable(elem), dtype=dtype) | ||
gdf = DataFrame() | ||
gdf["a"] = aa = np.fromiter(chain.from_iterable(elem), np.float64).astype( | ||
dtype | ||
) | ||
ranked_gs = gdf["a"].rank(method="first") | ||
df = pd.DataFrame() | ||
gdf["a"] = aa | ||
df["a"] = aa | ||
ranked_gs = gdf["a"].rank(method="first") | ||
ranked_ps = df["a"].rank(method="first") | ||
# Check | ||
assert_eq(ranked_ps, ranked_gs.to_pandas()) | ||
assert_eq(ranked_ps, ranked_gs) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
# Copyright (c) 2019-2022, NVIDIA CORPORATION. | ||
# Copyright (c) 2019-2023, NVIDIA CORPORATION. | ||
|
||
import itertools | ||
import operator | ||
|
@@ -79,9 +79,13 @@ def generate_valid_scalar_unaop_combos(): | |
|
||
@pytest.mark.parametrize("slr,dtype,op", generate_valid_scalar_unaop_combos()) | ||
def test_scalar_unary_operations(slr, dtype, op): | ||
slr_host = cudf.dtype(dtype).type(slr) | ||
slr_host = np.array([slr])[0].astype(cudf.dtype(dtype)) | ||
slr_device = cudf.Scalar(slr, dtype=dtype) | ||
|
||
if op.__name__ == "neg" and np.dtype(dtype).kind == "u": | ||
# TODO: what do we want to do here? | ||
return | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Numpy is fine with this right? Right? Negation of unsigned integers is totally well-defined. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well:
Should we just go ahead and ignore that warning in this test? (I've resorted to doing that in most other cases) |
||
|
||
expect = op(slr_host) | ||
got = op(slr_device) | ||
|
||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -137,7 +137,7 @@ def query_compile(expr): | |||||
key "args" is a sequence of name of the arguments. | ||||||
""" | ||||||
|
||||||
funcid = f"queryexpr_{np.uintp(hash(expr)):x}" | ||||||
funcid = f"queryexpr_{np.uintp(abs(hash(expr))):x}" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is incorrect,
Suggested change
That said, strings are hashable, so this seems like a weird way of constructing a cache key (it's somehow deliberately making it more likely that you get hash collisions and produce the wrong value). I would have thought that this would do the trick:
|
||||||
# Load cache | ||||||
compiled = _cache.get(funcid) | ||||||
# Cache not found | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hack will go away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be removed from this PR itself. Now that we've verified numpy 1.24 support, I would recommend removing the numba-related changes in this PR that you're using in order to allow running numba 0.57 (which is necessary to use numpy 1.24). We'll still get it tested because our CUDA 12 wheel builds will patch us to use 0.57 anyway (but with CUDA 12 we don't use cubinlinker/ptxcompiler so we don't need any edits for those). Then when we bump our numba to 0.57 tests should pass thanks to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one other question on this PR, would wait to make changes here until everything else is resolved in case you need to run more tests locally.