Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assigning Deterministic rank to Dask Workers Based on CUDA_VISIBLE_DEVICES (branch-23.06) #1587

Merged
merged 1 commit into from
Jun 12, 2023

Conversation

VibhuJawa
Copy link
Member

This PR is same as #1573 but targetted for branch-23.06 as a hotfix CC: @rlratzel

Previously, dask-raft non-deterministically maps a process to a GPU.

In this PR, we assign a deterministic order to each worker based on the CUDA_VISIBLE_DEVICES environment variable.
as NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node .

This fixes rapidsai/cugraph#3478
and this raft-test in MNMG setting

def func_test_device_multicast_sendrecv(sessionId, n_trials):
handle = local_handle(sessionId, dask_worker=get_worker())
return perform_test_comms_device_multicast_sendrecv(handle, n_trials)

@VibhuJawa VibhuJawa requested a review from a team as a code owner June 9, 2023 17:41
@github-actions github-actions bot added the python label Jun 9, 2023
@raydouglass raydouglass added bug Something isn't working non-breaking Non-breaking change labels Jun 9, 2023
Copy link
Contributor

@rlratzel rlratzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this PR was originally here but created again to target 23.06 instead.

@raydouglass raydouglass merged commit 6c81a41 into rapidsai:branch-23.06 Jun 12, 2023
seberg added a commit to seberg/raft that referenced this pull request Oct 25, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Oct 25, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Oct 25, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Nov 3, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Nov 14, 2023
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Feb 13, 2024
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Feb 13, 2024
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
seberg added a commit to seberg/raft that referenced this pull request Feb 13, 2024
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
VibhuJawa pushed a commit to VibhuJawa/raft that referenced this pull request Mar 14, 2024
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
rapids-bot bot pushed a commit that referenced this pull request Mar 15, 2024
This PR is based on @seberg work in  #1928 .  

From the PR:


This is a follow up on #1926, since the rank sorting seemed a bit hard to understand.

It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? 

If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs.

OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. 

The original #1587 mentions:

NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node
which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of:

Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation.
At least raise an error, because if NCCL will end up raising the error it will be very confusing.

Authors:
  - Vibhu Jawa (https://github.com/VibhuJawa)
  - Sebastian Berg (https://github.com/seberg)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #2228
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change python
Projects
Development

Successfully merging this pull request may close these issues.

4 participants