Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Using numba to detect gpu availability breaks Dask-CUDA worker pinning #144

Closed
rjzamora opened this issue Oct 4, 2022 · 0 comments · Fixed by #145
Closed

[Bug] Using numba to detect gpu availability breaks Dask-CUDA worker pinning #144

rjzamora opened this issue Oct 4, 2022 · 0 comments · Fixed by #145
Labels
bug Something isn't working P0
Milestone

Comments

@rjzamora
Copy link
Contributor

rjzamora commented Oct 4, 2022

While attempting to benchmark NVIDIA-Merlin/NVTabular#1687, I discovered that the dask-criteo benchmark does not work with the latest version of NVTabular/Merlin-core.

As far as I can tell, the problem is that #98 added the following logic to detect GPU availability: HAS_GPU = len(cuda.gpus.lst) > 0. This logic works just fine within a local process, but breaks Dask-CUDA device pinning when it is included in a top-level import (or is performed in the global context of the program). In other words, code like this shouldn't be executed by an import statement, like from merlin.core.compat import HAS_GPU.

The problem becomes apparent in a simple (Merlin-free) reproducer:

# reproducer.py
from dask_cuda import LocalCUDACluster
from numba import cuda # This is fine

HAS_GPU = len(cuda.gpus.lst) > 0  # This is not fine

if __name__ == "__main__":
    cluster = LocalCUDACluster()

If you execute python ./reproducer.py, you sill see warnings like:

/.../distributed/distributed/comm/ucx.py:67: UserWarning: Worker with process ID 49507 should have a CUDA context assigned to device 1, but instead the CUDA context is on device 0. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants