-
Notifications
You must be signed in to change notification settings - Fork 835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to set NCCL_SOCKET_IFNAME #286
Comments
Did you try setting Otherwise you can also write Finally, if you only have those two interfaces, you should not need to set anything since the default is |
it still gives same error, do I need to match nccl version on both computer. |
The NCCL version needs to match, but the environment variables might be different (for NCCL_SOCKET_IFNAME at least). I would start by not setting You should see lines which look like :
That will tell you which IP interface NCCL uses on each node. Then you can verify whether those two interfaces can actually communicate with each other (check your firewall, network setup, ...). |
Closing old issue. Please re-open if needed. |
I have two computer, when i do ifconfig, then one gives
‘eno1
lo’
2nd computer gives
‘eth0
lo’
What do I specify in nccl_socket_ifname on both nodes, do I specify lo on both?
I receive error like
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8
and
connection reset by peer
The text was updated successfully, but these errors were encountered: