-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT: Simplify NCCL worker rank identification #1928
Conversation
7592639
to
b76963e
Compare
/ok to test |
399a37d
to
cef2263
Compare
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending CI!
@seberg it looks like a couple of the test cases failed. Do you want to try and get these changes into 23.12? |
5fb2c83
to
fca015e
Compare
/ok to test |
@seberg there's still a few failing tests in this PR. Do you still intend to contribute these improvements? (Just doing some PR upkeep while we're gearing up for release). |
/ok to test |
9cccd58
to
8d2a9b0
Compare
I have rebased the test away to partially to understand if that was the problem, I don't think it really got quite to the bottom of it anyway, although would be nice to have:
|
This is a follow up on rapidsaigh-1926, since the rank sorting seemed a bit hard to understand. It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs. OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. The original rapidsaigh-1587 mentions: NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: * Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation. * At least raise an error, because if NCCL will end up raising the error it will be very confusing.
8d2a9b0
to
063bd70
Compare
@seberg , Would be amazing if we can get this PR in soon, in our internal cugraph experiments we ran into a error on 64+ nodes which seem to get fixed when we use this PR , dont really understand why but will be nice to get our customers unblocked. |
Just wanted to follow up on this; this PR is critical for us. We are currently unable to run at scale without it and have resorted to patching the containers we use for testing. |
@VibhuJawa would you be able to take a look at it? I can see that my test was bad (although not sure in what way). Rebasing it away isn't great, but since it is, I really don't know what is blocking it (where there further test issues, beyond my failed attempt at adding a new test?). |
Sure, will try to build and repro this locally and see where we land. |
Started a PR here with your test @seberg , Cant seem to recreate the error with this test (#1928 (comment)) locally so trying to recreate it on CI. |
I am not sure if the tests will hit anything. I thought the test should run into it (before your old fixes anyway) but that is only if it is run with an set/number of workers on it. |
Yup, I agree. It's good to have a test and that the test is only valid at a certain scale of use. I think it will be useful for me and users in general when running into problems at scale (often >128 workers) and verify all of the raft dask tests pass. Lets see if CI passes #2228 here , (last run ran into unrelated C++ test issues which i think should be fixed after merging main into it). If the tests pass , we can either merge that PR or this PR with tests added (no preference) , if it does not it gives us info for triaging actual failures. |
Sure, go ahead. |
This PR is based on @seberg work in #1928 . From the PR: This is a follow up on #1926, since the rank sorting seemed a bit hard to understand. It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal? If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs. OTOH, if the NVML device order doesn't matter, we could just sort the workers directly. The original #1587 mentions: NCCL>1.11 expects a process with rank r to be mapped to r % num_gpus_per_node which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: Ensure we can guarantee this, but this requires initializing workers that are not involved in the operation. At least raise an error, because if NCCL will end up raising the error it will be very confusing. Authors: - Vibhu Jawa (https://github.com/VibhuJawa) - Sebastian Berg (https://github.com/seberg) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #2228
This is a follow up on gh-1926, since the rank sorting seemed a bit hard to understand.
It does modify the logic in the sense that the host is now sorted by IP as a way to group based on it. But I don't really think that host sorting was ever a goal?
If the goal is really about being deterministic, then this should be more (or at least clearer) deterministic about order of worker IPs.
OTOH, if the NVML device order doesn't matter, we could just sort the workers directly.
The original gh-1587 mentions:
which is something that neither approach seems to quite assure, if such a requirement exists, I would want to do one of: