Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assigning Deterministic rank to Dask Workers Based on CUDA_VISIBLE_DEVICES #1573

Closed

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Jun 2, 2023

Previously, dask-raft non-deterministically maps a process to a GPU.

In this PR, we assign a deterministic order to each worker based on the CUDA_VISIBLE_DEVICES environment variable.
as NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node .

This fixes rapidsai/cugraph#3478
and this raft-test in MNMG setting

def func_test_device_multicast_sendrecv(sessionId, n_trials):
handle = local_handle(sessionId, dask_worker=get_worker())
return perform_test_comms_device_multicast_sendrecv(handle, n_trials)

CC: @jnke2016

@github-actions github-actions bot added the python label Jun 2, 2023
@rlratzel rlratzel added bug Something isn't working non-breaking Non-breaking change labels Jun 3, 2023
@jnke2016
Copy link

jnke2016 commented Jun 7, 2023

@VibhuJawa thanks a lot for this PR. It looks good. However, it seems to uncover an issue with batch edge betweenness centrality as it is the only one failing. @ChuckHastings we were thinking about merging this PR as it fixes the nccl failure that we have been debugging and later, find a fix for batch edge betweenness centrality unless you have something else in mind. In the mean time, I have been working on bindings for edge betweenness centrality(with the hope that this failure doesn't occur in the new implementation) as you are implementing it but we have to make a call for this PR as raft is entering code freeze soon.

@VibhuJawa VibhuJawa marked this pull request as ready for review June 7, 2023 18:46
@VibhuJawa VibhuJawa requested a review from a team as a code owner June 7, 2023 18:46
@ChuckHastings
Copy link
Contributor

@VibhuJawa thanks a lot for this PR. It looks good. However, it seems to uncover an issue with batch edge betweenness centrality as it is the only one failing. @ChuckHastings we were thinking about merging this PR as it fixes the nccl failure that we have been debugging and later, find a fix for batch edge betweenness centrality unless you have something else in mind. In the mean time, I have been working on bindings for edge betweenness centrality(with the hope that this failure doesn't occur in the new implementation) as you are implementing it but we have to make a call for this PR as raft is entering code freeze soon.

Definitely merge this to fix the nccl failure. The batch edge betweenness centrality failure is clearly a separate issue. We would need to resolve that separately anyway.

@cjnolet
Copy link
Member

cjnolet commented Jun 7, 2023

@VibhuJawa can you open a quick cuml PR that pins your branch in get_raft.cmake just so we can make sure this doesn't somehow break cuml?

@VibhuJawa
Copy link
Member Author

@VibhuJawa can you open a quick cuml PR that pins your branch in get_raft.cmake just so we can make sure this doesn't somehow break cuml?

Started PR : https://github.com/rapidsai/cuml/pull/5462/files

@VibhuJawa VibhuJawa changed the base branch from branch-23.08 to branch-23.06 June 9, 2023 17:33
@VibhuJawa VibhuJawa requested review from a team as code owners June 9, 2023 17:33
@VibhuJawa VibhuJawa changed the base branch from branch-23.06 to branch-23.08 June 9, 2023 17:34
raydouglass pushed a commit that referenced this pull request Jun 12, 2023
This PR is same as #1573 but targetted for branch-23.06 as a hotfix CC: @rlratzel 

Previously, `dask-raft` non-deterministically maps a process to a GPU. 

In this PR, we assign a deterministic order to each worker based on the CUDA_VISIBLE_DEVICES environment variable.
as NCCL>1.11 expects a process with `rank r` to be mapped to `r % num_gpus_per_node` . 


This  fixes rapidsai/cugraph#3478 
and this raft-test in MNMG setting  https://github.com/rapidsai/raft/blob/c1a7b7c0e33b11d2e7ff3bc5014e3b410a2edd0d/python/raft-dask/raft_dask/test/test_comms.py#L82-L84

Authors:
   - Vibhu Jawa (https://github.com/VibhuJawa)

Approvers:
   - Rick Ratzel (https://github.com/rlratzel)
   - Corey J. Nolet (https://github.com/cjnolet)
raydouglass pushed a commit to rapidsai/cugraph that referenced this pull request Jun 13, 2023
A Raft rapidsai/raft#1573 assigning deterministic ranks to dask workers was merged, breaking batch algorithms like batch_edge_betweenness_centrality by picking the wrong worker as the root for the broadcast operation.
This PR ensures that the worker with rank = 0 is the root of the broadcast operation.

Authors:
   - jnke2016 ([email protected])

Approvers:
   - Vibhu Jawa (https://github.com/VibhuJawa)
   - Rick Ratzel (https://github.com/rlratzel)
@VibhuJawa
Copy link
Member Author

Closing as covered by #1587

@VibhuJawa VibhuJawa closed this Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change python
Projects
Development

Successfully merging this pull request may close these issues.

[BUG]: MNMG tests failing on later version of NCCL (>2.11.4.1)
5 participants