Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: Simplify NCCL worker rank identification #2228

Merged
merged 4 commits into from
Mar 15, 2024

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Mar 14, 2024

This PR is based on @seberg work in #1928 .

From the PR:

This is a follow up on #1926, since the rank sorting seemed a bit hard to understand.

It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal?

If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs.

OTOH, if the NVML device order doesn't matter, we could just sort the workers directly.

The original #1587 mentions:

NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node
which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of:

Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation.
At least raise an error, because if NCCL will end up raising the error it will be very confusing.

@VibhuJawa VibhuJawa requested a review from a team as a code owner March 14, 2024 22:14
Copy link

copy-pr-bot bot commented Mar 14, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@VibhuJawa VibhuJawa added the non-breaking Non-breaking change label Mar 14, 2024
@VibhuJawa VibhuJawa added the bug Something isn't working label Mar 14, 2024
@VibhuJawa
Copy link
Member Author

@cjnolet , Would it be possible to get an okay to test comment here please.

Want to understand why it will break on CI .

@VibhuJawa
Copy link
Member Author

/ok to test

2 similar comments
@jakirkham
Copy link
Member

/ok to test

@VibhuJawa
Copy link
Member Author

/ok to test

@jakirkham
Copy link
Member

@VibhuJawa could you please squash the commits so there is one signed commit? That should fix the CI issues here

seberg and others added 3 commits March 14, 2024 16:07
This is a follow up on rapidsaigh-1926, since the rank sorting seemed
a bit hard to understand.
It does modify the logic in the sense that the host is now sorted
by IP as a way to group based on it.  But I don't really think that
host sorting was ever a goal?

If the goal is really about being deterministic, then this should
be more (or at least clearer) deterministic about order of worker
IPs.
OTOH, if the NVML device order doesn't matter, we could just sort
the workers directly.

The original rapidsaigh-1587 mentions:

    NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure,
if such a requirement exists, I would want to do one of:
* Ensure we can guarantee this, but this requires initializing
  workers that are not involved in the operation.
* At least raise an error, because if NCCL will end up raising
  the error it will be very confusing.
@VibhuJawa VibhuJawa force-pushed the simplify-nccl-index branch from bfb1116 to 296c922 Compare March 14, 2024 23:09
@VibhuJawa VibhuJawa changed the title [TEST] Test https://github.com/rapidsai/raft/pull/1928 MAINT: Simplify NCCL worker rank identification Mar 15, 2024
@VibhuJawa VibhuJawa self-assigned this Mar 15, 2024
@VibhuJawa
Copy link
Member Author

@cjnolet , Would it be possible to get a review and merge on this .

Since we all-ready approved, #1928 and the only delta b/w them is a test we should be good to go.

@cjnolet
Copy link
Member

cjnolet commented Mar 15, 2024

/merge

@rapids-bot rapids-bot bot merged commit 36484f4 into rapidsai:branch-24.04 Mar 15, 2024
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants