-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL all_reduce_perf errors with 5090s #287
Comments
It looks to me like there is no P2P connectivity between the 2 GPUs. There should be... So, in a way, |
GeForce cards do not support P2P. The CUDA failure seems to happen when we launch the kernel. I see this line in the log:
Perhaps that's the reason for the launch error? @RCS1 could you try to remove this line (line 488 of src/include/device.h): |
Hi sjeaugey, Thanks for your reply. I'm unsure where that file would be located? That doesn't seem to be a directory I have. |
Results of p2pBandwidthLatencyTest
|
Also to note; Nvidia 570 (closed) drivers do not recognize 5090s (unsure as to why) so I am currently using 570-open Could this be causing issues? |
The ncclMaxSharedMem message is likely the issue here. That should have been a WARN/exit and I've fixed it in the next release. On these systems you'll just have to reduce the amount of Shared Memory NCCL requests with this change:
|
Is there a way I can adjust the test command to get around this and see performance? Thank you! |
No, you need to modify the NCCL library in order for the CUDA kernels to work on these GPU SKUs. It will be fixed in NCCL 2.26.x |
@RCS1 I was suggesting that you checkout the NCCL source code, delete line 488 of Now maybe my change wouldn't work, and the patch above would work better. |
all_reduce_perf test errors when using dual 5090 GPUs. Works fine with one 5090.
Using nvidia driver 570.86.16
1x 5090;
2x 5090s;
2x 5090s with NCCL_DEBUG=INFO
I've also tried adding "NCCL_P2P_DISABLE=1" with the same results.
The text was updated successfully, but these errors were encountered: