-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[minor] support multi-node training #389
Conversation
++++++++++++++++++++++++++++++++++++++++++ 已验证 |
Please check the difference. @cdliang11 |
ok,之前没注意这个#131 。
|
examples/cnceleb/v2/run.sh
Outdated
stage=-1 | ||
stop_stage=-1 | ||
|
||
HOST_NODE_ADDR="localhost:0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 这里的port 0是必须的吗?0感觉更容易被系统进程占用的样子(torchrun的默认端口应该是29400)
- 如果是单机上跑两个多卡任务,是否把port 0改成其他值即可?(可以验证下)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part0不是必须的,可以随便改。可以统一改为29400。
单机跑两个多卡任务,用两个不同的端口号即可。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job!
chengdong @cdliang11 , 问一下,我看这里其实并没有显示指定哪个机器才是host_node, 是torchrun现在支持根据相同的job_id去找关联的进程,然后随机分配一个host_node吗?不知道我理解的对不对 |
指定master。 # 在172.16.0.101执行
bash run.sh --stage 3 --stop-stage 3 --HOST_NODE_ADDR "172.16.0.101:23333" --num_nodes 2
# 在172.16.0.102执行
bash run.sh --stage 3 --stop-stage 3 --HOST_NODE_ADDR "172.16.0.101:23333" --num_nodes 2 |
No description provided.