-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to support Numpy >= 1.24 #13325
Conversation
python/cudf/cudf/tests/test_csv.py
Outdated
@@ -150,8 +150,8 @@ def make_all_numeric_extremes_dataframe(): | |||
np_type = pdf_dtypes[gdf_dtype] | |||
if np.issubdtype(np_type, np.integer): | |||
itype = np.iinfo(np_type) | |||
extremes = [0, +1, -1, itype.min, itype.max] | |||
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20] | |||
extremes = [itype.min, itype.max] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to change the comments at the beginning of these tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? Is this a task you want to accomplish in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that 0, +1, and -1 aren't extrema for integer types, but is there a reason you remove them from these tests? I suppose perhaps that np.uint8(-1)
now raises OverflowError or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.uint8(-1)
in particular raises a deprecation warning. But the reason I gave up on this test was because I couldn't figure out why this was happening:
In [2]: np.array([-1]).astype("uint64")
Out[2]: array([18446744073709551615], dtype=uint64)
In [3]: np.array([18446744073709551615]).astype("uint64")
Out[3]: array([18446744073709551615], dtype=uint64)
In [4]: np.array([-1, 18446744073709551615]).astype("uint64")
<ipython-input-4-03014ed268fc>:1: RuntimeWarning: invalid value encountered in cast
np.array([-1, 18446744073709551615]).astype("uint64")
Out[4]: array([18446744073709551615, 0], dtype=uint64)
I've gone ahead and filtered out that warning from this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the reason I gave up on this test was because I couldn't figure out why this was happening:
np.array([-1, 2**64 - 1]).dtype == "float64"
which is lossy.
python/cudf/cudf/__init__.py
Outdated
@@ -96,6 +97,7 @@ | |||
|
|||
_setup_numba_linker(_PTX_FILE) | |||
|
|||
patch_numba_codegen_if_needed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hack will go away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be removed from this PR itself. Now that we've verified numpy 1.24 support, I would recommend removing the numba-related changes in this PR that you're using in order to allow running numba 0.57 (which is necessary to use numpy 1.24). We'll still get it tested because our CUDA 12 wheel builds will patch us to use 0.57 anyway (but with CUDA 12 we don't use cubinlinker/ptxcompiler so we don't need any edits for those). Then when we bump our numba to 0.57 tests should pass thanks to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one other question on this PR, would wait to make changes here until everything else is resolved in case you need to run more tests locally.
Can you remove the numpy upper bound pinnings in dependencies.yaml and the cudf meta.yaml? Then in the CUDA 12.0 CI let's make sure that at least one run installs numpy 1.24 and still passes tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Success! https://github.com/rapidsai/cudf/actions/runs/4953167580/jobs/8860521944?pr=13325#step:10:828 shows that numpy 1.24 is now getting installed into the 12.0.1 wheel build, and tests are passing
python/cudf/cudf/__init__.py
Outdated
@@ -96,6 +97,7 @@ | |||
|
|||
_setup_numba_linker(_PTX_FILE) | |||
|
|||
patch_numba_codegen_if_needed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be removed from this PR itself. Now that we've verified numpy 1.24 support, I would recommend removing the numba-related changes in this PR that you're using in order to allow running numba 0.57 (which is necessary to use numpy 1.24). We'll still get it tested because our CUDA 12 wheel builds will patch us to use 0.57 anyway (but with CUDA 12 we don't use cubinlinker/ptxcompiler so we don't need any edits for those). Then when we bump our numba to 0.57 tests should pass thanks to this PR.
python/cudf/cudf/tests/test_csv.py
Outdated
@@ -150,8 +150,8 @@ def make_all_numeric_extremes_dataframe(): | |||
np_type = pdf_dtypes[gdf_dtype] | |||
if np.issubdtype(np_type, np.integer): | |||
itype = np.iinfo(np_type) | |||
extremes = [0, +1, -1, itype.min, itype.max] | |||
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20] | |||
extremes = [itype.min, itype.max] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate? Is this a task you want to accomplish in this PR?
@@ -398,8 +398,8 @@ def test_column_view_string_slice(slc): | |||
cudf.core.column.as_column([], dtype="uint8"), | |||
), | |||
( | |||
cp.array([453], dtype="uint8"), | |||
cudf.core.column.as_column([453], dtype="uint8"), | |||
cp.array([255], dtype="uint8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it doesn't matter what value we choose here? Just wondering if it's important to use 453-256.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters. I use 255
just because it's `np.iinfo(uint8).max'
@@ -194,6 +194,7 @@ def test_to_numeric_downcast_int(data, downcast): | |||
assert_eq(expected, got) | |||
|
|||
|
|||
@pytest.mark.filterwarnings("ignore:invalid value encountered in cast") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of applying this to the whole test, can we just wrap the pd.to_numeric
call? This doesn't affect the cudf.to_numeric
call, does it?
Also, should we be handling the warning conditionally? i.e. I assuming this happens when trying to downcast a signed to an unsigned type or something?
python/cudf/cudf/__init__.py
Outdated
@@ -96,6 +97,7 @@ | |||
|
|||
_setup_numba_linker(_PTX_FILE) | |||
|
|||
patch_numba_codegen_if_needed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one other question on this PR, would wait to make changes here until everything else is resolved in case you need to run more tests locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like there are some strange things going on in the changes in the tests?
python/cudf/cudf/tests/test_csv.py
Outdated
extremes = [0, +1, -1, itype.min, itype.max] | ||
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20] | ||
extremes = [itype.min, itype.max] | ||
df[gdf_dtype] = np.array(extremes * 10, dtype=np_type)[:20] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to change from 4 pairs of extrema to 10?
python/cudf/cudf/tests/test_csv.py
Outdated
@@ -150,8 +150,8 @@ def make_all_numeric_extremes_dataframe(): | |||
np_type = pdf_dtypes[gdf_dtype] | |||
if np.issubdtype(np_type, np.integer): | |||
itype = np.iinfo(np_type) | |||
extremes = [0, +1, -1, itype.min, itype.max] | |||
df[gdf_dtype] = np.array(extremes * 4, dtype=np_type)[:20] | |||
extremes = [itype.min, itype.max] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that 0, +1, and -1 aren't extrema for integer types, but is there a reason you remove them from these tests? I suppose perhaps that np.uint8(-1)
now raises OverflowError or something?
nrows = request.param | ||
|
||
# Create a pandas dataframe with random data of mixed types | ||
test_pdf = pd.DataFrame( | ||
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)], | ||
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"), | ||
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the columns in this dataframe now have type int64
, no? Since they are never downcast with astype
.
python/cudf/cudf/tests/test_json.py
Outdated
nrows = request.param | ||
|
||
# Create a pandas dataframe with random data of mixed types | ||
test_pdf = pd.DataFrame( | ||
[list(range(ncols * i, ncols * (i + 1))) for i in range(nrows)], | ||
columns=pd.Index([f"col_{typ}" for typ in types], name="foo"), | ||
{f"col_{typ}": np.random.randint(0, nrows, nrows) for typ in types} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one in contrast is cast to the appropriate type.
python/cudf/cudf/tests/test_rank.py
Outdated
@@ -125,7 +125,7 @@ def test_rank_error_arguments(pdf): | |||
np.full((3,), np.inf), | |||
np.full((3,), -np.inf), | |||
] | |||
sort_dtype_args = [np.int32, np.int64, np.float32, np.float64] | |||
sort_dtype_args = [np.float32, np.float64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we now don't run some tests with integer dtypes. Is it that they don't make sense any more?
slr_device = cudf.Scalar(slr, dtype=dtype) | ||
|
||
if op.__name__ == "neg" and np.dtype(dtype).kind == "u": | ||
# TODO: what do we want to do here? | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy is fine with this right? Right? Negation of unsigned integers is totally well-defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well:
In [2]: -np.uint16(1)
<ipython-input-2-5156426c8f88>:1: RuntimeWarning: overflow encountered in scalar negative
-np.uint16(1)
Out[2]: 65535
Should we just go ahead and ignore that warning in this test? (I've resorted to doing that in most other cases)
python/cudf/cudf/utils/queryutils.py
Outdated
@@ -137,7 +137,7 @@ def query_compile(expr): | |||
key "args" is a sequence of name of the arguments. | |||
""" | |||
|
|||
funcid = f"queryexpr_{np.uintp(hash(expr)):x}" | |||
funcid = f"queryexpr_{np.uintp(abs(hash(expr))):x}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect, hash
returns in the semi-open interval [-2**63, 2**63)
, but abs
folds this to the closed interval [0, 2**63]
(so you alias just shy of 50% of the values). Instead, you want to shift, I suspect, and then you don't need numpy in the loop at all:
funcid = f"queryexpr_{np.uintp(abs(hash(expr))):x}" | |
funcid = f"queryexpr_{hash(expr) + 2**63:x}" |
That said, strings are hashable, so this seems like a weird way of constructing a cache key (it's somehow deliberately making it more likely that you get hash collisions and produce the wrong value).
I would have thought that this would do the trick:
@functools.cache
def query_compile(expr):
name = "queryexpr" # these are only looked up locally so names can collide
info = query_parser(expr)
fn = query_builder(info, name)
args = info["args"]
devicefn = cudf.jit(device=True)(fn)
kernel = _wrap_query_expr(f"kernel_{name}", devicefn, args)
info["kernel"] = kernel
return info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving ops-codeowner
file changes
@vyasr just resolved the conflicts here. Yeah I think it could use a quick skim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
/merge |
Description
Closes #13301
Checklist