-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NVML index usage in CUDAWorker/LocalCUDACluster #671
Fix NVML index usage in CUDAWorker/LocalCUDACluster #671
Conversation
This is similar to issues we had not long ago in Distributed, as @charlesbluca would remember. This came up when discussing MIG support with @akaanirban , thanks Anirban for bringing that to my attention. |
@beckernick @VibhuJawa this may have affected past performance on DGX A100s, where we had previously to manually launch |
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #671 +/- ##
=================================================
+ Coverage 60.19% 90.31% +30.12%
=================================================
Files 21 15 -6
Lines 2605 1652 -953
=================================================
- Hits 1568 1492 -76
+ Misses 1037 160 -877
Continue to review full report at Codecov.
|
rerun tests |
… devices if available
@gpucibot merge |
Adds support to start LocalCUDACluster and cuda workers on MIG instances by passing in uuids of the mig instances. Builds off of existing PR #671 More specifically this PR does the following: 1. Allows starting `LocalCUDACluster` as the following: `cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=["MIG-uuid1","MIG-uuid2",...])` or by passing them as `,` separated strings. Needs Discussion: 0. Apart from manually testing on a MIG instance on the cloud, how would we test this? 1. What if the user does not pass in any argument to `LocalCUDACluster` while using MIG instances? By default `LocalCUDACluster` will try to use all the parent GPUs and run into error. 2. What if we have a deployment with MIG-enabled and non-MIG-enabled GPUs? 3. `dask.distributed` diagnostics will also fail if we run on MIG enabled GPUs since it uses `pynvml` APIS for non-MIG-enabled GPUs only at the moment. Authors: - Anirban Das (https://github.com/akaanirban) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #674
Fix an issue where device index
0
would be used in NVML functions. For CUDA runtime calls, we expect that the GPU being targeted is always on index0
as its relative to theCUDA_VISIBLE_DEVICES
ordering. However, NVML relies on absolute indices, thus we have to always use the actual GPU index being targeted, rather than the first one inCUDA_VISIBLE_DEVICES
.This is normally not an issue if no
CUDA_VISIBLE_DEVICES
is set, or is just set aslist(",".join(list(str(i) for i in range(get_n_gpus())))
, but it may be an issue when targeting a different list of GPUs. For example on a DGX-1, CPU affinity for GPUs0-3
is0-19,40-59
, and for GPUs4-7
it is20-39,60-79
, but when the user would setCUDA_VISIBLE_DEVICES=4
the CPU affinity for the targeted GPU would be the same as for device index0
, or0-19,40-59
. This could result in lower performance, as well as wrong computation of total GPU memory for non-homogenous GPU systems.