-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assigning Deterministic rank to Dask Workers Based on CUDA_VISIBLE_DEVICES #1573
Assigning Deterministic rank to Dask Workers Based on CUDA_VISIBLE_DEVICES #1573
Conversation
@VibhuJawa thanks a lot for this PR. It looks good. However, it seems to uncover an issue with batch edge betweenness centrality as it is the only one failing. @ChuckHastings we were thinking about merging this PR as it fixes the |
Definitely merge this to fix the |
@VibhuJawa can you open a quick cuml PR that pins your branch in |
Started PR : https://github.com/rapidsai/cuml/pull/5462/files |
This PR is same as #1573 but targetted for branch-23.06 as a hotfix CC: @rlratzel Previously, `dask-raft` non-deterministically maps a process to a GPU. In this PR, we assign a deterministic order to each worker based on the CUDA_VISIBLE_DEVICES environment variable. as NCCL>1.11 expects a process with `rank r` to be mapped to `r % num_gpus_per_node` . This fixes rapidsai/cugraph#3478 and this raft-test in MNMG setting https://github.com/rapidsai/raft/blob/c1a7b7c0e33b11d2e7ff3bc5014e3b410a2edd0d/python/raft-dask/raft_dask/test/test_comms.py#L82-L84 Authors: - Vibhu Jawa (https://github.com/VibhuJawa) Approvers: - Rick Ratzel (https://github.com/rlratzel) - Corey J. Nolet (https://github.com/cjnolet)
A Raft rapidsai/raft#1573 assigning deterministic ranks to dask workers was merged, breaking batch algorithms like batch_edge_betweenness_centrality by picking the wrong worker as the root for the broadcast operation. This PR ensures that the worker with rank = 0 is the root of the broadcast operation. Authors: - jnke2016 ([email protected]) Approvers: - Vibhu Jawa (https://github.com/VibhuJawa) - Rick Ratzel (https://github.com/rlratzel)
Closing as covered by #1587 |
Previously,
dask-raft
non-deterministically maps a process to a GPU.In this PR, we assign a deterministic order to each worker based on the CUDA_VISIBLE_DEVICES environment variable.
as NCCL>1.11 expects a process with
rank r
to be mapped tor % num_gpus_per_node
.This fixes rapidsai/cugraph#3478
and this raft-test in MNMG setting
raft/python/raft-dask/raft_dask/test/test_comms.py
Lines 82 to 84 in c1a7b7c
CC: @jnke2016