Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warnings about existing CUDA contexts on dask-cuda cluster startup #721

Closed
randerzander opened this issue Sep 10, 2021 · 3 comments · Fixed by #722
Closed

Warnings about existing CUDA contexts on dask-cuda cluster startup #721

randerzander opened this issue Sep 10, 2021 · 3 comments · Fixed by #722

Comments

@randerzander
Copy link
Contributor

randerzander commented Sep 10, 2021

On starting up a dask-cuda cluster on a DGX2, I get this warning for every worker:

/home/rgelhausen/conda/envs/test/lib/python3.8/site-packages/dask_cuda/initialize.py:22: UserWarning: A CUDA context for device 14 already exists on process ID 3287969. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
  warnings.warn(

Checking nvidia-smi, I only see one process per GPU, so nothing looks out of the ordinary despite the warning suggesting I may have multiple workers on a single GPU.

@pentschev
Copy link
Member

This is due to #719 . The warning happens with UCX only because we do create the CUDA context in Distributed (Distributed comms initializes before initializer plugins that are used by Dask-CUDA). So I’ll need to rewire things a bit in Distributed.

@pentschev
Copy link
Member

Opened dask/distributed#5308 and #722, they should be enough to address the issue here.

@rapids-bot rapids-bot bot closed this as completed in #722 Sep 14, 2021
rapids-bot bot pushed a commit that referenced this issue Sep 14, 2021
Because communications in `Nanny` are initialized before Dask preload plugins, and UCX creates the context directly within its own initializer in Distributed, Dask-CUDA will always think the CUDA context has already been incorrectly initialized when using UCX, which isn't true, with the globals added here Dask-CUDA can verify the CUDA contexts are indeed valid.

Depends on dask/distributed#5308 .

Fixes #721 .

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - https://github.com/jakirkham

URL: #722
@pentschev
Copy link
Member

This is in, thanks for reporting @randerzander and please let me know if you experience any problems still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants