Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] memcheck errors in dask-cudf tests #15204

Closed
wence- opened this issue Mar 1, 2024 · 6 comments
Closed

[BUG] memcheck errors in dask-cudf tests #15204

wence- opened this issue Mar 1, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@wence-
Copy link
Contributor

wence- commented Mar 1, 2024

Describe the bug

Since (recently?) some dask-cudf tests fail intermittently. See https://github.com/rapidsai/cudf/actions/runs/8109433681/job/22166693131?pr=15143

Invalid repro due to cuda-python runtime reimplementation confusing me

Here is a stripped down repro.

In the debugger I get Cuda API error detected: cuCtxGetDevice returned (0xc9)

import numpy as np
import dask_cudf
import cudf

if __name__ == "__main__":
    npartitions = 8

    by = ["b"]
    datarange = 35
    size = 100
    gdf = cudf.DataFrame(
        {
            "a": np.arange(size, dtype="int64"),
            "b": np.random.randint(datarange, size=size),
        }
    )


    npartitions_initial = 17
    ddf = dask_cudf.from_cudf(gdf, npartitions=npartitions_initial)

    ddf_new = ddf.shuffle(
        on=by, ignore_index=True, npartitions=npartitions, max_branch=4
    )

    ddf_new.compute(scheduler="threads")

Run with

cuda-gdb --args python bug.py
(cuda-gdb) set cuda api_failures stop
(cuda-gdb) run
...
[Switching to Thread 0x7fff25ffb640 (LWP 613272)]
Cuda API error detected: cuCtxGetDevice returned (0xc9)
(cuda-gdb) bt
#0  0x00007fffa35a28e0 in cudbgReportDriverApiError () from /usr/lib/x86_64-linux-gnu/libcuda.so
#1  0x00007fffa3869e0b in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007fffa1b516e7 in ?? () from /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
#3  0x00007fffa1b300ce in ?? () from /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
#4  0x00007fffa1b40b66 in ?? () from /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
#5  0x00007fffa1c9fb26 in ?? () from /usr/lib/x86_64-linux-gnu/libcudadebugger.so.1
#6  0x00007fffa36caf4d in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#7  0x00007fffa91d8b5c in __pyx_f_4cuda_5_cuda_5ccuda__cuCtxGetDevice(int*) () from /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_cuda/ccuda.cpython-310-x86_64-linux-gnu.so
#8  0x00007fffa9138d1e in __pyx_f_4cuda_4_lib_7ccudart_7ccudart__cudaGetDevice(int*) () from /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_lib/ccudart/ccudart.cpython-310-x86_64-linux-gnu.so
#9  0x00007fffaa6bf60a in __pyx_f_4cuda_7ccudart_cudaGetDevice(int*) () from /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/ccudart.cpython-310-x86_64-linux-gnu.so
#10 0x00007fffa662a181 in __pyx_pw_4cuda_6cudart_67cudaGetDevice(_object*, _object*) () from /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/cudart.cpython-310-x86_64-linux-gnu.so
#11 0x000055555568d022 in _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7fff95f52f40, callable=0x7fffa632cba0, tstate=0x5555592ace90) at /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:114
#12 PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7fff95f52f40, callable=0x7fffa632cba0) at /usr/local/src/conda/python-3.10.13/Include/cpython/abstract.h:123
#13 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, trace_info=0x7fff25ff8550, tstate=<optimized out>) at /usr/local/src/conda/python-3.10.13/Python/ceval.c:5893
#14 _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0x7fff95f52dc0, throwflag=<optimized out>) at /usr/local/src/conda/python-3.10.13/Python/ceval.c:4181
#15 0x00005555556988cc in _PyEval_EvalFrame (throwflag=0, f=0x7fff95f52dc0, tstate=0x5555592ace90) at /usr/local/src/conda/python-3.10.13/Include/internal/pycore_ceval.h:46
#16 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, con=0x7fffce1fb140, tstate=0x5555592ace90) at /usr/local/src/conda/python-3.10.13/Python/ceval.c:5067
#17 _PyFunction_Vectorcall (func=0x7fffce1fb130, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at /usr/local/src/conda/python-3.10.13/Objects/call.c:342

I suspect some thread-based race condition, since if I switch to the "processes" or "synchronous" scheduler this problem goes away. If I reduce the number of partitions (so there are fewer overall tasks) the problem also goes away.

@wence- wence- added the bug Something isn't working label Mar 1, 2024
@wence-
Copy link
Contributor Author

wence- commented Mar 1, 2024

0xc9 is:

    /**
     * This most frequently indicates that there is no context bound to the
     * current thread. This can also be returned if the context passed to an
     * API call is not a valid handle (such as a context that has had
     * ::cuCtxDestroy() invoked on it). This can also be returned if a user
     * mixes different API versions (i.e. 3010 context with 3020 API calls).
     * See ::cuCtxGetApiVersion() for more details.
     */
    CUDA_ERROR_INVALID_CONTEXT                = 201,

@wence-
Copy link
Contributor Author

wence- commented Mar 1, 2024

My vague suspicion is that cuda-python is not properly thread-safe. If I apply this patch in RMM to use numba for obtaining device ids instead of the cudart provided by cuda-python then I get no error:

diff --git a/python/rmm/_cuda/gpu.py b/python/rmm/_cuda/gpu.py
index 2a23b41e..8279e52c 100644
--- a/python/rmm/_cuda/gpu.py
+++ b/python/rmm/_cuda/gpu.py
@@ -1,6 +1,7 @@
 # Copyright (c) 2020, NVIDIA CORPORATION.
 
 from cuda import cuda, cudart
+from numba import cuda as ncuda
 
 
 class CUDARuntimeError(RuntimeError):
@@ -53,6 +54,7 @@ def getDevice():
     """
     Get the current CUDA device
     """
+    return ncuda.get_current_device().id
     status, device = cudart.cudaGetDevice()
     if status != cudart.cudaError_t.cudaSuccess:
         raise CUDARuntimeError(status)
@@ -67,6 +69,8 @@ def setDevice(device: int):
     device : int
         The ID of the device to set as current
     """
+    ncuda.select_device(device)
+    return
     (status,) = cudart.cudaSetDevice(device)
     if status != cudart.cudaError_t.cudaSuccess:
         raise CUDARuntimeError(status)
@@ -97,6 +101,7 @@ def getDeviceCount():
     This function automatically raises CUDARuntimeError with error message
     and status code.
     """
+    return len(ncuda.devices.gpus)
     status, count = cudart.cudaGetDeviceCount()
     if status != cudart.cudaError_t.cudaSuccess:
         raise CUDARuntimeError(status)

@wence-
Copy link
Contributor Author

wence- commented Mar 1, 2024

Hmm, maybe my cuda-python is broken?

I run:

from cuda import cudart
print(cudart.cudaGetDevice())

under compute-sanitizer and have:

========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2caf4d]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__pyx_f_4cuda_5_cuda_5ccuda__cuCtxGetDevice(int*) [0x4cb5c]
=========                in /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_cuda/ccuda.cpython-310-x86_64-linux-gnu.so
=========     Host Frame:__pyx_f_4cuda_4_lib_7ccudart_7ccudart__cudaGetDevice(int*) [0x9d1e]
=========                in /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_lib/ccudart/ccudart.cpython-310-x86_64-linux-gnu.so

@wence-
Copy link
Contributor Author

wence- commented Mar 1, 2024

Hmm, maybe my cuda-python is broken?

I run:

from cuda import cudart
print(cudart.cudaGetDevice())

under compute-sanitizer and have:

========= COMPUTE-SANITIZER
========= Program hit CUDA_ERROR_INVALID_CONTEXT (error 201) due to "invalid device context" on CUDA API call to cuCtxGetDevice.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x2caf4d]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame:__pyx_f_4cuda_5_cuda_5ccuda__cuCtxGetDevice(int*) [0x4cb5c]
=========                in /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_cuda/ccuda.cpython-310-x86_64-linux-gnu.so
=========     Host Frame:__pyx_f_4cuda_4_lib_7ccudart_7ccudart__cudaGetDevice(int*) [0x9d1e]
=========                in /home/coder/.conda/envs/rapids/lib/python3.10/site-packages/cuda/_lib/ccudart/ccudart.cpython-310-x86_64-linux-gnu.so

No, this is because compute-sanitizer can't identify that this is a "safe" call because cuda-python implements its own runtime.
So I think maybe my initial repro is bad, but there is still a bug.

@wence-
Copy link
Contributor Author

wence- commented Mar 4, 2024

This was a false positive, but there are problems in the avro reader, opening a new issue.

@wence- wence- closed this as completed Mar 4, 2024
@bdice
Copy link
Contributor

bdice commented Mar 4, 2024

For posterity, the new issue is #15216.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants