You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Sep 19, 2022. It is now read-only.
During the PyTorch Job distributed learning, sometimes the 'Worker' cannot find the 'Master' with below message.
Traceback (most recent call last):
File "/workspace/src/bert/benchmark.py", line 2248, in <module>
main()
File "/workspace/src/bert/benchmark.py", line 2212, in main
torch.distributed.init_process_group(backend='nccl')
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Name or service not known
In pytorch job, 'worker' check connection with 'master' using 'nslookup' command as below, but the connection between 'master' and 'worker' might not be fully ready even if nslookup command succeeds.
command: ['sh', '-c', 'until nslookup {{.MasterAddr}}; do echo waiting for master; sleep 2; done;']`
So, I'm using 'netcat' command instead of 'nslookup'.
The following example shows that netcat test fails even if the nslookup test succeeds.
netcat shows success within 4~10 sec after nslookup succeeds in my environment.
During the PyTorch Job distributed learning, sometimes the 'Worker' cannot find the 'Master' with below message.
In pytorch job, 'worker' check connection with 'master' using 'nslookup' command as below, but the connection between 'master' and 'worker' might not be fully ready even if nslookup command succeeds.
So, I'm using 'netcat' command instead of 'nslookup'.
The following example shows that netcat test fails even if the nslookup test succeeds.
netcat shows success within 4~10 sec after nslookup succeeds in my environment.
I guess there is a slight delay until virtual ip with the port is opened completely in k8s after service is created and endpoint is assigned.
So, Could you please check this issue?
And are there any plans to modify below code to pass the master port as a parameter as well as the master address when creating the init Container?
Because, I'm using 'netcat' command with hard-coded port, because only 'MasterAddr' is passed as a parameter when creating an init container.
Best regards!
The text was updated successfully, but these errors were encountered: