[BUG] memcheck errors in dask-cudf tests #15204

wence- · 2024-03-01T15:48:17Z

Describe the bug

Since (recently?) some dask-cudf tests fail intermittently. See https://github.com/rapidsai/cudf/actions/runs/8109433681/job/22166693131?pr=15143

Invalid repro due to cuda-python runtime reimplementation confusing me

Here is a stripped down repro.

In the debugger I get Cuda API error detected: cuCtxGetDevice returned (0xc9)

import numpy as np
import dask_cudf
import cudf

if __name__ == "__main__":
    npartitions = 8

    by = ["b"]
    datarange = 35
    size = 100
    gdf = cudf.DataFrame(
        {
            "a": np.arange(size, dtype="int64"),
            "b": np.random.randint(datarange, size=size),
        }
    )


    npartitions_initial = 17
    ddf = dask_cudf.from_cudf(gdf, npartitions=npartitions_initial)

    ddf_new = ddf.shuffle(
        on=by, ignore_index=True, npartitions=npartitions, max_branch=4
    )

    ddf_new.compute(scheduler="threads")

Run with

cuda-gdb --args python bug.py
(cuda-gdb) set cuda api_failures stop
(cuda-gdb) run
...
[Switching to Thread 0x7fff25ffb640 (LWP 613272)]
Cuda API error detected: cuCtxGetDevice returned (0xc9)
(cuda-gdb) bt
#0  0x00007fffa35a28e0 in cudbgReportDriverApiError () from /usr/lib/x86_64-linux-gnu/libcuda.so
#1  0x00007fffa3869e0b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fffa1b516e7 in ?? () from /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
#3  0x00007fffa1b300ce in ?? () from /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
#4  0x00007fffa1b40b66 in ?? () from /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
#5  0x00007fffa1c9fb26 in ?? () from /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
#6  0x00007fffa36caf4d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#7  0x00007fffa91d8b5c in __pyx_f_4cuda_5_cuda_5ccuda__cuCtxGetDevice(int*) () from /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_cuda/ccuda.cpython-310-x86_64-linux-gnu.so
#8  0x00007fffa9138d1e in __pyx_f_4cuda_4_lib_7ccudart_7ccudart__cudaGetDevice(int*) () from /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_lib/ccudart/ccudart.cpython-310-x86_64-linux-gnu.so
#9  0x00007fffaa6bf60a in __pyx_f_4cuda_7ccudart_cudaGetDevice(int*) () from /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/ccudart.cpython-310-x86_64-linux-gnu.so
#10 0x00007fffa662a181 in __pyx_pw_4cuda_6cudart_67cudaGetDevice(_object*, _object*) () from /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/cudart.cpython-310-x86_64-linux-gnu.so
#11 0x000055555568d022 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fff95f52f40, callable=0x7fffa632cba0, tstate=0x5555592ace90) at /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:114
#12 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fff95f52f40, callable=0x7fffa632cba0) at /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:123
#13 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fff25ff8550, tstate=<optimized out>) at /usr/local/src/conda/python-3.10.13/Python/ceval.c:5893
#14 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7fff95f52dc0, throwflag=<optimized out>) at /usr/local/src/conda/python-3.10.13/Python/ceval.c:4181
#15 0x00005555556988cc in _PyEval_EvalFrame (throwflag=0, f=0x7fff95f52dc0, tstate=0x5555592ace90) at /usr/local/src/conda/python-3.10.13/Include/internal/pycore_ceval.h:46
#16 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7fffce1fb140, tstate=0x5555592ace90) at /usr/local/src/conda/python-3.10.13/Python/ceval.c:5067
#17 _PyFunction_Vectorcall (func=0x7fffce1fb130, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.10.13/Objects/call.c:342

I suspect some thread-based race condition, since if I switch to the "processes" or "synchronous" scheduler this problem goes away. If I reduce the number of partitions (so there are fewer overall tasks) the problem also goes away.

The text was updated successfully, but these errors were encountered:

wence- · 2024-03-01T16:01:54Z

0xc9 is:

    /**
     * This most frequently indicates that there is no context bound to the
     * current thread. This can also be returned if the context passed to an
     * API call is not a valid handle (such as a context that has had
     * ::cuCtxDestroy() invoked on it). This can also be returned if a user
     * mixes different API versions (i.e. 3010 context with 3020 API calls).
     * See ::cuCtxGetApiVersion() for more details.
     */
    CUDA_ERROR_INVALID_CONTEXT                = 201,

wence- · 2024-03-01T17:29:32Z

My vague suspicion is that cuda-python is not properly thread-safe. If I apply this patch in RMM to use numba for obtaining device ids instead of the cudart provided by cuda-python then I get no error:

diff --git a/python/rmm/_cuda/gpu.py b/python/rmm/_cuda/gpu.py
index 2a23b41e..8279e52c 100644
--- a/python/rmm/_cuda/gpu.py
+++ b/python/rmm/_cuda/gpu.py
@@ -1,6 +1,7 @@
 # Copyright (c) 2020, NVIDIA CORPORATION.
 
 from cuda import cuda, cudart
+from numba import cuda as ncuda
 
 
 class CUDARuntimeError(RuntimeError):
@@ -53,6 +54,7 @@ def getDevice():
     """
     Get the current CUDA device
     """
+    return ncuda.get_current_device().id
     status, device = cudart.cudaGetDevice()
     if status != cudart.cudaError_t.cudaSuccess:
         raise CUDARuntimeError(status)
@@ -67,6 +69,8 @@ def setDevice(device: int):
     device : int
         The ID of the device to set as current
     """
+    ncuda.select_device(device)
+    return
     (status,) = cudart.cudaSetDevice(device)
     if status != cudart.cudaError_t.cudaSuccess:
         raise CUDARuntimeError(status)
@@ -97,6 +101,7 @@ def getDeviceCount():
     This function automatically raises CUDARuntimeError with error message
     and status code.
     """
+    return len(ncuda.devices.gpus)
     status, count = cudart.cudaGetDeviceCount()
     if status != cudart.cudaError_t.cudaSuccess:
         raise CUDARuntimeError(status)

wence- · 2024-03-01T17:43:22Z

Hmm, maybe my cuda-python is broken?

I run:

from cuda import cudart
print(cudart.cudaGetDevice())

under compute-sanitizer and have:

========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2caf4d]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__pyx_f_4cuda_5_cuda_5ccuda__cuCtxGetDevice(int*) [0x4cb5c]
=========                in /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_cuda/ccuda.cpython-310-x86_64-linux-gnu.so
=========     Host Frame:__pyx_f_4cuda_4_lib_7ccudart_7ccudart__cudaGetDevice(int*) [0x9d1e]
=========                in /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_lib/ccudart/ccudart.cpython-310-x86_64-linux-gnu.so

wence- · 2024-03-01T17:58:05Z

Hmm, maybe my cuda-python is broken?

I run:

from cuda import cudart
print(cudart.cudaGetDevice())

under compute-sanitizer and have:

========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2caf4d]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__pyx_f_4cuda_5_cuda_5ccuda__cuCtxGetDevice(int*) [0x4cb5c]
=========                in /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_cuda/ccuda.cpython-310-x86_64-linux-gnu.so
=========     Host Frame:__pyx_f_4cuda_4_lib_7ccudart_7ccudart__cudaGetDevice(int*) [0x9d1e]
=========                in /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_lib/ccudart/ccudart.cpython-310-x86_64-linux-gnu.so

No, this is because compute-sanitizer can't identify that this is a "safe" call because cuda-python implements its own runtime.
So I think maybe my initial repro is bad, but there is still a bug.

wence- · 2024-03-04T11:32:14Z

This was a false positive, but there are problems in the avro reader, opening a new issue.

bdice · 2024-03-04T12:51:19Z

For posterity, the new issue is #15216.

wence- added the bug Something isn't working label Mar 1, 2024

wence- mentioned this issue Mar 1, 2024

Fix GroupBy.get_group and GroupBy.indices #15143

Merged

3 tasks

wence- closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] memcheck errors in dask-cudf tests #15204

[BUG] memcheck errors in dask-cudf tests #15204

wence- commented Mar 1, 2024 •

edited

Loading

wence- commented Mar 1, 2024

wence- commented Mar 1, 2024

wence- commented Mar 1, 2024

wence- commented Mar 1, 2024

wence- commented Mar 4, 2024

bdice commented Mar 4, 2024

[BUG] memcheck errors in dask-cudf tests #15204

[BUG] memcheck errors in dask-cudf tests #15204

Comments

wence- commented Mar 1, 2024 • edited Loading

wence- commented Mar 1, 2024

wence- commented Mar 1, 2024

wence- commented Mar 1, 2024

wence- commented Mar 1, 2024

wence- commented Mar 4, 2024

bdice commented Mar 4, 2024

wence- commented Mar 1, 2024 •

edited

Loading