Fix NVML index usage in CUDAWorker/LocalCUDACluster #671

pentschev · 2021-07-13T20:06:47Z

Fix an issue where device index 0 would be used in NVML functions. For CUDA runtime calls, we expect that the GPU being targeted is always on index 0 as its relative to the CUDA_VISIBLE_DEVICES ordering. However, NVML relies on absolute indices, thus we have to always use the actual GPU index being targeted, rather than the first one in CUDA_VISIBLE_DEVICES.

This is normally not an issue if no CUDA_VISIBLE_DEVICES is set, or is just set as list(",".join(list(str(i) for i in range(get_n_gpus()))), but it may be an issue when targeting a different list of GPUs. For example on a DGX-1, CPU affinity for GPUs 0-3 is 0-19,40-59, and for GPUs 4-7 it is 20-39,60-79, but when the user would set CUDA_VISIBLE_DEVICES=4 the CPU affinity for the targeted GPU would be the same as for device index 0, or 0-19,40-59. This could result in lower performance, as well as wrong computation of total GPU memory for non-homogenous GPU systems.

pentschev · 2021-07-13T20:08:47Z

This is similar to issues we had not long ago in Distributed, as @charlesbluca would remember. This came up when discussing MIG support with @akaanirban , thanks Anirban for bringing that to my attention.

pentschev · 2021-07-13T20:14:25Z

@beckernick @VibhuJawa this may have affected past performance on DGX A100s, where we had previously to manually launch dask-cuda-worker for each GPU individually.

codecov-commenter · 2021-07-13T23:44:03Z

Codecov Report

Merging #671 (c41ad75) into branch-21.08 (79bc44e) will increase coverage by 30.12%.
The diff coverage is 88.88%.

❗ Current head c41ad75 differs from pull request most recent head f84fcf4. Consider uploading reports for the commit f84fcf4 to get more accurate results

@@                Coverage Diff                @@
##           branch-21.08     #671       +/-   ##
=================================================
+ Coverage         60.19%   90.31%   +30.12%     
=================================================
  Files                21       15        -6     
  Lines              2605     1652      -953     
=================================================
- Hits               1568     1492       -76     
+ Misses             1037      160      -877

Impacted Files	Coverage Δ
dask_cuda/local_cuda_cluster.py	`79.41% <ø> (ø)`
dask_cuda/utils.py	`88.99% <83.33%> (+1.25%)`	⬆️
dask_cuda/cli/dask_cuda_worker.py	`97.14% <100.00%> (+1.49%)`	⬆️
dask_cuda/cuda_worker.py	`78.31% <100.00%> (ø)`
dask_cuda/benchmarks/local_cudf_shuffle.py
dask_cuda/_version.py
dask_cuda/benchmarks/local_cudf_merge.py
dask_cuda/benchmarks/utils.py
dask_cuda/benchmarks/local_cupy_map_overlap.py
dask_cuda/benchmarks/local_cupy.py
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79bc44e...f84fcf4. Read the comment docs.

pentschev · 2021-07-14T10:17:51Z

rerun tests

… devices if available

quasiben · 2021-07-20T18:12:59Z

@gpucibot merge

Adds support to start LocalCUDACluster and cuda workers on MIG instances by passing in uuids of the mig instances. Builds off of existing PR #671 More specifically this PR does the following: 1. Allows starting `LocalCUDACluster` as the following: `cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=["MIG-uuid1","MIG-uuid2",...])` or by passing them as `,` separated strings. Needs Discussion: 0. Apart from manually testing on a MIG instance on the cloud, how would we test this? 1. What if the user does not pass in any argument to `LocalCUDACluster` while using MIG instances? By default `LocalCUDACluster` will try to use all the parent GPUs and run into error. 2. What if we have a deployment with MIG-enabled and non-MIG-enabled GPUs? 3. `dask.distributed` diagnostics will also fail if we run on MIG enabled GPUs since it uses `pynvml` APIS for non-MIG-enabled GPUs only at the moment. Authors: - Anirban Das (https://github.com/akaanirban) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #674

pentschev added 3 commits July 13, 2021 12:53

Add nvml_device_index util function

a0c823d

Fix NVML index usage in CUDAWorker/LocalCUDACluster

d5227f0

Test for CPU affinity based on appropriate NVML indexing

d6cc2c7

pentschev requested a review from a team as a code owner July 13, 2021 20:06

github-actions bot added the python python code needed label Jul 13, 2021

pentschev added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change and removed python python code needed labels Jul 13, 2021

pentschev added 2 commits July 13, 2021 15:21

Support for CUDA_VISIBLE_DEVICES list type in nvml_device_index

3b1bb53

Fix nvml_device_index usage in LocalCUDACluster

e2ca068

github-actions bot added the python python code needed label Jul 13, 2021

Clean up debug code

f84fcf4

akaanirban added a commit to akaanirban/dask-cuda that referenced this pull request Jul 16, 2021

Added test from pull req rapidsai#671 and test to check number of mig…

1452c7b

… devices if available

akaanirban mentioned this pull request Jul 16, 2021

Support for LocalCUDACluster with MIG #674

Merged

quasiben approved these changes Jul 20, 2021

View reviewed changes

rapids-bot bot merged commit c04856e into rapidsai:branch-21.08 Jul 20, 2021

pentschev deleted the fix-nvml-device-index branch July 20, 2021 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NVML index usage in CUDAWorker/LocalCUDACluster #671

Fix NVML index usage in CUDAWorker/LocalCUDACluster #671

pentschev commented Jul 13, 2021

pentschev commented Jul 13, 2021

pentschev commented Jul 13, 2021

codecov-commenter commented Jul 13, 2021 •

edited

Loading

pentschev commented Jul 14, 2021

quasiben commented Jul 20, 2021

Fix NVML index usage in CUDAWorker/LocalCUDACluster #671

Fix NVML index usage in CUDAWorker/LocalCUDACluster #671

Conversation

pentschev commented Jul 13, 2021

pentschev commented Jul 13, 2021

pentschev commented Jul 13, 2021

codecov-commenter commented Jul 13, 2021 • edited Loading

Codecov Report

pentschev commented Jul 14, 2021

quasiben commented Jul 20, 2021

codecov-commenter commented Jul 13, 2021 •

edited

Loading