Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set NCCL_SOCKET_IFNAME #286

Closed
vainaixr opened this issue Jan 21, 2020 · 4 comments
Closed

How to set NCCL_SOCKET_IFNAME #286

vainaixr opened this issue Jan 21, 2020 · 4 comments

Comments

@vainaixr
Copy link

vainaixr commented Jan 21, 2020

I have two computer, when i do ifconfig, then one gives
‘eno1
lo’
2nd computer gives
‘eth0
lo’
What do I specify in nccl_socket_ifname on both nodes, do I specify lo on both?

I receive error like
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8
and
connection reset by peer

@sjeaugey
Copy link
Member

lo is loopback so you won't be able to use it between two nodes.

Did you try setting NCCL_SOCKET_IFNAME=eno1,eth0 ?

Otherwise you can also write NCCL_SOCKET_IFNAME=eno1 in /etc/nccl.conf on one node and NCCL_SOCKET_IFNAME=eth0 in /etc/nccl.conf on the other node.
If you can't set /etc/nccl.conf, you can also use $HOME/.nccl.conf if not shared between the two nodes.

Finally, if you only have those two interfaces, you should not need to set anything since the default is NCCL_SOCKET_IFNAME=^lo,docker which would pick the first interface which is neither lo nor docker*.

@vainaixr
Copy link
Author

it still gives same error, do I need to match nccl version on both computer.

@sjeaugey
Copy link
Member

The NCCL version needs to match, but the environment variables might be different (for NCCL_SOCKET_IFNAME at least).

I would start by not setting NCCL_SOCKET_IFNAME at all, but set NCCL_DEBUG=INFO.

You should see lines which look like :

node1:12345:12345 [0] NCCL INFO Bootstrap : Using [0]eth0:192.168.0.1<0>

That will tell you which IP interface NCCL uses on each node. Then you can verify whether those two interfaces can actually communicate with each other (check your firewall, network setup, ...).

@sjeaugey
Copy link
Member

Closing old issue. Please re-open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants