-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] memcheck errors in dask-cudf tests #15204
Comments
|
My vague suspicion is that diff --git a/python/rmm/_cuda/gpu.py b/python/rmm/_cuda/gpu.py
index 2a23b41e..8279e52c 100644
--- a/python/rmm/_cuda/gpu.py
+++ b/python/rmm/_cuda/gpu.py
@@ -1,6 +1,7 @@
# Copyright (c) 2020, NVIDIA CORPORATION.
from cuda import cuda, cudart
+from numba import cuda as ncuda
class CUDARuntimeError(RuntimeError):
@@ -53,6 +54,7 @@ def getDevice():
"""
Get the current CUDA device
"""
+ return ncuda.get_current_device().id
status, device = cudart.cudaGetDevice()
if status != cudart.cudaError_t.cudaSuccess:
raise CUDARuntimeError(status)
@@ -67,6 +69,8 @@ def setDevice(device: int):
device : int
The ID of the device to set as current
"""
+ ncuda.select_device(device)
+ return
(status,) = cudart.cudaSetDevice(device)
if status != cudart.cudaError_t.cudaSuccess:
raise CUDARuntimeError(status)
@@ -97,6 +101,7 @@ def getDeviceCount():
This function automatically raises CUDARuntimeError with error message
and status code.
"""
+ return len(ncuda.devices.gpus)
status, count = cudart.cudaGetDeviceCount()
if status != cudart.cudaError_t.cudaSuccess:
raise CUDARuntimeError(status) |
Hmm, maybe my cuda-python is broken? I run:
under compute-sanitizer and have:
|
No, this is because compute-sanitizer can't identify that this is a "safe" call because |
This was a false positive, but there are problems in the avro reader, opening a new issue. |
For posterity, the new issue is #15216. |
Describe the bug
Since (recently?) some dask-cudf tests fail intermittently. See https://github.com/rapidsai/cudf/actions/runs/8109433681/job/22166693131?pr=15143
Invalid repro due to cuda-python runtime reimplementation confusing me
Here is a stripped down repro.
In the debugger I get
Cuda API error detected: cuCtxGetDevice returned (0xc9)
Run with
I suspect some thread-based race condition, since if I switch to the
"processes"
or"synchronous"
scheduler this problem goes away. If I reduce the number of partitions (so there are fewer overall tasks) the problem also goes away.The text was updated successfully, but these errors were encountered: