Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[minor] support multi-node training #389

Merged
merged 2 commits into from
Dec 2, 2024
Merged

Conversation

cdliang11
Copy link
Collaborator

No description provided.

@cdliang11
Copy link
Collaborator Author

cdliang11 commented Nov 29, 2024

voxceleb/v2/run.sh and cnceleb/v2/run.sh have been verified.
The default setting is HOST_NODE_ADDR="localhost:0", which means single machine.

++++++++++++++++++++++++++++++++++++++++++

已验证 voxceleb/v2/run.sh / cnceleb/v2/run.sh
默认设置为HOST_NODE_ADDR="localhost:0" 表示单机

@JiJiJiang
Copy link
Collaborator

Please check the difference. @cdliang11

#131 by @czy97

@cdliang11
Copy link
Collaborator Author

Please check the difference. @cdliang11

#131 by @czy97

ok,之前没注意这个#131
对比了下:

  • torchrun的用法相同
  • train.py中更改了一个小bug:gpu通过local_rank获取,否则会报list越界。
  • 同时改了run.sh脚本,方便使用

stage=-1
stop_stage=-1

HOST_NODE_ADDR="localhost:0"
Copy link
Collaborator

@JiJiJiang JiJiJiang Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这里的port 0是必须的吗?0感觉更容易被系统进程占用的样子(torchrun的默认端口应该是29400)
  2. 如果是单机上跑两个多卡任务,是否把port 0改成其他值即可?(可以验证下)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part0不是必须的,可以随便改。可以统一改为29400。
单机跑两个多卡任务,用两个不同的端口号即可。

@cdliang11 cdliang11 changed the title [train] support multi-node training [minor] support multi-node training Dec 2, 2024
Copy link
Collaborator

@JiJiJiang JiJiJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@JiJiJiang JiJiJiang merged commit d2e1bf2 into master Dec 2, 2024
4 checks passed
@JiJiJiang JiJiJiang deleted the chengdong-multi-node branch December 2, 2024 06:26
@czy97
Copy link
Collaborator

czy97 commented Dec 2, 2024

voxceleb/v2/run.sh and cnceleb/v2/run.sh have been verified. The default setting is HOST_NODE_ADDR="localhost:0", which means single machine.

++++++++++++++++++++++++++++++++++++++++++

已验证 voxceleb/v2/run.sh / cnceleb/v2/run.sh 默认设置为HOST_NODE_ADDR="localhost:0" 表示单机

chengdong @cdliang11 , 问一下,我看这里其实并没有显示指定哪个机器才是host_node, 是torchrun现在支持根据相同的job_id去找关联的进程,然后随机分配一个host_node吗?不知道我理解的对不对

@cdliang11
Copy link
Collaborator Author

voxceleb/v2/run.sh and cnceleb/v2/run.sh have been verified. The default setting is HOST_NODE_ADDR="localhost:0", which means single machine.
++++++++++++++++++++++++++++++++++++++++++
已验证 voxceleb/v2/run.sh / cnceleb/v2/run.sh 默认设置为HOST_NODE_ADDR="localhost:0" 表示单机

chengdong @cdliang11 , 问一下,我看这里其实并没有显示指定哪个机器才是host_node, 是torchrun现在支持根据相同的job_id去找关联的进程,然后随机分配一个host_node吗?不知道我理解的对不对

指定master。
以voxceleb/v2为例,假设有两台机器,选172.16.0.101为master节点:

# 在172.16.0.101执行
bash run.sh --stage 3 --stop-stage 3 --HOST_NODE_ADDR "172.16.0.101:23333" --num_nodes 2
# 在172.16.0.102执行
bash run.sh --stage 3 --stop-stage 3 --HOST_NODE_ADDR "172.16.0.101:23333" --num_nodes 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants