MAINT: Simplify NCCL worker rank identification #1928

seberg · 2023-10-25T11:48:57Z

This is a follow up on gh-1926, since the rank sorting seemed a bit hard to understand.
It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal?

If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs.
OTOH, if the NVML device order doesn't matter, we could just sort the workers directly.

The original gh-1587 mentions:

NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node

which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of:

Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation.
At least raise an error, because if NCCL will end up raising the error it will be very confusing.

copy-pr-bot · 2023-10-25T11:49:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjnolet · 2023-10-27T17:17:11Z

/ok to test

cjnolet · 2023-11-07T22:21:17Z

/ok to test

cjnolet

LGTM pending CI!

cjnolet · 2023-11-08T02:34:09Z

@seberg it looks like a couple of the test cases failed. Do you want to try and get these changes into 23.12?

cjnolet · 2023-11-15T05:18:51Z

/ok to test

cjnolet · 2024-01-17T15:55:26Z

@seberg there's still a few failing tests in this PR. Do you still intend to contribute these improvements? (Just doing some PR upkeep while we're gearing up for release).

cjnolet · 2024-01-17T15:55:59Z

/ok to test

seberg · 2024-02-13T22:13:46Z

I have rebased the test away to partially to understand if that was the problem, I don't think it really got quite to the bottom of it anyway, although would be nice to have:

@pytest.mark.nccl
@pytest.mark.parametrize(
    "subset", [slice(-1, None), slice(1), slice(None, None, -2)]
)
def test_comm_init_worker_subset(client, subset):
    # Basic test that initializing a subset of workers is fine
    cb = Comms(comms_p2p=True, verbose=True)

    workers = list(client.scheduler_info()["workers"].keys())
    workers = workers[subset]
    cb.init(workers=workers)

This is a follow up on rapidsaigh-1926, since the rank sorting seemed a bit hard to understand. It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs. OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. The original rapidsaigh-1587 mentions: NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: * Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation. * At least raise an error, because if NCCL will end up raising the error it will be very confusing.

VibhuJawa · 2024-02-22T23:26:30Z

@seberg , Would be amazing if we can get this PR in soon, in our internal cugraph experiments we ran into a error on 64+ nodes which seem to get fixed when we use this PR , dont really understand why but will be nice to get our customers unblocked.

alexbarghi-nv · 2024-02-28T18:25:59Z

Just wanted to follow up on this; this PR is critical for us. We are currently unable to run at scale without it and have resorted to patching the containers we use for testing.

seberg · 2024-03-11T13:45:54Z

@VibhuJawa would you be able to take a look at it? I can see that my test was bad (although not sure in what way). Rebasing it away isn't great, but since it is, I really don't know what is blocking it (where there further test issues, beyond my failed attempt at adding a new test?).

VibhuJawa · 2024-03-12T17:14:53Z

@VibhuJawa would you be able to take a look at it?

Sure, will try to build and repro this locally and see where we land.

VibhuJawa · 2024-03-14T22:17:01Z

Started a PR here with your test @seberg , Cant seem to recreate the error with this test (#1928 (comment)) locally so trying to recreate it on CI.

#2228

seberg · 2024-03-15T06:35:11Z

I am not sure if the tests will hit anything. I thought the test should run into it (before your old fixes anyway) but that is only if it is run with an set/number of workers on it.

VibhuJawa · 2024-03-15T17:04:49Z

I am not sure if the tests will hit anything. I thought the test should run into it (before your old fixes anyway) but that is only if it is run with an set/number of workers on it.

Yup, I agree. It's good to have a test and that the test is only valid at a certain scale of use.

I think it will be useful for me and users in general when running into problems at scale (often >128 workers) and verify all of the raft dask tests pass.

Lets see if CI passes #2228 here , (last run ran into unrelated C++ test issues which i think should be fixed after merging main into it).

If the tests pass , we can either merge that PR or this PR with tests added (no preference) , if it does not it gives us info for triaging actual failures.

VibhuJawa · 2024-03-15T20:03:20Z

@seberg , Looks like CI passed, Do you think we should just merge in #2228 , once we have your approval and go ahead ? (Will obviously edit the PR title, description etc to match this PR)

seberg · 2024-03-15T20:52:05Z

Sure, go ahead.

@seberg

This PR is based on @seberg work in #1928 . From the PR: This is a follow up on #1926, since the rank sorting seemed a bit hard to understand. It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs. OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. The original #1587 mentions: NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation. At least raise an error, because if NCCL will end up raising the error it will be very confusing. Authors: - Vibhu Jawa (https://github.com/VibhuJawa) - Sebastian Berg (https://github.com/seberg) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2228

seberg requested a review from a team as a code owner October 25, 2023 11:48

github-actions bot added the python label Oct 25, 2023

seberg force-pushed the simplify-nccl-index branch from 7592639 to b76963e Compare October 25, 2023 11:49

cjnolet added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 27, 2023

cjnolet assigned seberg Oct 27, 2023

seberg force-pushed the simplify-nccl-index branch from 399a37d to cef2263 Compare November 3, 2023 10:10

cjnolet approved these changes Nov 7, 2023

View reviewed changes

seberg force-pushed the simplify-nccl-index branch from 5fb2c83 to fca015e Compare November 14, 2023 20:36

jnke2016 mentioned this pull request Jan 3, 2024

[BUG]: Graph Creation Failure at Scale rapidsai/cugraph#4076

Closed

2 tasks

cjnolet changed the base branch from branch-23.12 to branch-24.02 January 17, 2024 15:56

seberg force-pushed the simplify-nccl-index branch 2 times, most recently from 9cccd58 to 8d2a9b0 Compare February 13, 2024 22:13

seberg changed the base branch from branch-24.02 to branch-24.04 February 13, 2024 22:22

seberg force-pushed the simplify-nccl-index branch from 8d2a9b0 to 063bd70 Compare February 13, 2024 22:22

rlratzel mentioned this pull request Feb 20, 2024

[BUG] Unable to Initialize Comms with Multiple Dask Clients rapidsai/cugraph#4089

Closed

Merge branch 'branch-24.04' into simplify-nccl-index

3bc7e92

VibhuJawa mentioned this pull request Mar 14, 2024

MAINT: Simplify NCCL worker rank identification #2228

Merged

Merge branch 'branch-24.04' into simplify-nccl-index

464fbec

seberg closed this Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: Simplify NCCL worker rank identification #1928

MAINT: Simplify NCCL worker rank identification #1928

seberg commented Oct 25, 2023

copy-pr-bot bot commented Oct 25, 2023

cjnolet commented Oct 27, 2023

cjnolet commented Nov 7, 2023

cjnolet left a comment

cjnolet commented Nov 8, 2023

cjnolet commented Nov 15, 2023

cjnolet commented Jan 17, 2024

cjnolet commented Jan 17, 2024

seberg commented Feb 13, 2024

VibhuJawa commented Feb 22, 2024

alexbarghi-nv commented Feb 28, 2024

seberg commented Mar 11, 2024

VibhuJawa commented Mar 12, 2024 •

edited

Loading

VibhuJawa commented Mar 14, 2024

seberg commented Mar 15, 2024

VibhuJawa commented Mar 15, 2024 •

edited

Loading

VibhuJawa commented Mar 15, 2024 •

edited

Loading

seberg commented Mar 15, 2024

MAINT: Simplify NCCL worker rank identification #1928

MAINT: Simplify NCCL worker rank identification #1928

Conversation

seberg commented Oct 25, 2023

copy-pr-bot bot commented Oct 25, 2023

cjnolet commented Oct 27, 2023

cjnolet commented Nov 7, 2023

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet commented Nov 8, 2023

cjnolet commented Nov 15, 2023

cjnolet commented Jan 17, 2024

cjnolet commented Jan 17, 2024

seberg commented Feb 13, 2024

VibhuJawa commented Feb 22, 2024

alexbarghi-nv commented Feb 28, 2024

seberg commented Mar 11, 2024

VibhuJawa commented Mar 12, 2024 • edited Loading

VibhuJawa commented Mar 14, 2024

seberg commented Mar 15, 2024

VibhuJawa commented Mar 15, 2024 • edited Loading

VibhuJawa commented Mar 15, 2024 • edited Loading

seberg commented Mar 15, 2024

VibhuJawa commented Mar 12, 2024 •

edited

Loading

VibhuJawa commented Mar 15, 2024 •

edited

Loading

VibhuJawa commented Mar 15, 2024 •

edited

Loading