Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NVML index usage in CUDAWorker/LocalCUDACluster #671

Merged
merged 6 commits into from
Jul 20, 2021

Conversation

pentschev
Copy link
Member

Fix an issue where device index 0 would be used in NVML functions. For CUDA runtime calls, we expect that the GPU being targeted is always on index 0 as its relative to the CUDA_VISIBLE_DEVICES ordering. However, NVML relies on absolute indices, thus we have to always use the actual GPU index being targeted, rather than the first one in CUDA_VISIBLE_DEVICES.

This is normally not an issue if no CUDA_VISIBLE_DEVICES is set, or is just set as list(",".join(list(str(i) for i in range(get_n_gpus()))), but it may be an issue when targeting a different list of GPUs. For example on a DGX-1, CPU affinity for GPUs 0-3 is 0-19,40-59, and for GPUs 4-7 it is 20-39,60-79, but when the user would set CUDA_VISIBLE_DEVICES=4 the CPU affinity for the targeted GPU would be the same as for device index 0, or 0-19,40-59. This could result in lower performance, as well as wrong computation of total GPU memory for non-homogenous GPU systems.

@pentschev pentschev requested a review from a team as a code owner July 13, 2021 20:06
@github-actions github-actions bot added the python python code needed label Jul 13, 2021
@pentschev pentschev added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change and removed python python code needed labels Jul 13, 2021
@pentschev
Copy link
Member Author

This is similar to issues we had not long ago in Distributed, as @charlesbluca would remember. This came up when discussing MIG support with @akaanirban , thanks Anirban for bringing that to my attention.

@pentschev
Copy link
Member Author

@beckernick @VibhuJawa this may have affected past performance on DGX A100s, where we had previously to manually launch dask-cuda-worker for each GPU individually.

@github-actions github-actions bot added the python python code needed label Jul 13, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jul 13, 2021

Codecov Report

Merging #671 (c41ad75) into branch-21.08 (79bc44e) will increase coverage by 30.12%.
The diff coverage is 88.88%.

❗ Current head c41ad75 differs from pull request most recent head f84fcf4. Consider uploading reports for the commit f84fcf4 to get more accurate results
Impacted file tree graph

@@                Coverage Diff                @@
##           branch-21.08     #671       +/-   ##
=================================================
+ Coverage         60.19%   90.31%   +30.12%     
=================================================
  Files                21       15        -6     
  Lines              2605     1652      -953     
=================================================
- Hits               1568     1492       -76     
+ Misses             1037      160      -877     
Impacted Files Coverage Δ
dask_cuda/local_cuda_cluster.py 79.41% <ø> (ø)
dask_cuda/utils.py 88.99% <83.33%> (+1.25%) ⬆️
dask_cuda/cli/dask_cuda_worker.py 97.14% <100.00%> (+1.49%) ⬆️
dask_cuda/cuda_worker.py 78.31% <100.00%> (ø)
dask_cuda/benchmarks/local_cudf_shuffle.py
dask_cuda/_version.py
dask_cuda/benchmarks/local_cudf_merge.py
dask_cuda/benchmarks/utils.py
dask_cuda/benchmarks/local_cupy_map_overlap.py
dask_cuda/benchmarks/local_cupy.py
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79bc44e...f84fcf4. Read the comment docs.

@pentschev
Copy link
Member Author

rerun tests

akaanirban added a commit to akaanirban/dask-cuda that referenced this pull request Jul 16, 2021
@quasiben
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit c04856e into rapidsai:branch-21.08 Jul 20, 2021
@pentschev pentschev deleted the fix-nvml-device-index branch July 20, 2021 18:45
rapids-bot bot pushed a commit that referenced this pull request Aug 2, 2021
Adds support to start LocalCUDACluster and cuda workers on MIG instances by passing in uuids of the mig instances. Builds off of existing PR #671
More specifically this PR does the following:
1. Allows starting `LocalCUDACluster` as the following: `cluster = LocalCUDACluster(CUDA_VISIBLE_DEVICES=["MIG-uuid1","MIG-uuid2",...])` or by passing them as `,` separated strings. 

Needs Discussion:
0. Apart from manually testing on a MIG instance on the cloud, how would we test this?
1. What if the user does not pass in any argument to `LocalCUDACluster` while using MIG instances? By default `LocalCUDACluster` will try to use all the parent GPUs and run into error.
2. What if we have a deployment with MIG-enabled and non-MIG-enabled GPUs?
3. `dask.distributed` diagnostics will also fail if we run on MIG enabled GPUs since it uses `pynvml` APIS for non-MIG-enabled GPUs only at the moment.

Authors:
  - Anirban Das (https://github.com/akaanirban)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #674
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants