Skip to content

Commit

Permalink
add multi-machine dist_train (#1303)
Browse files Browse the repository at this point in the history
  • Loading branch information
ZCMax authored Mar 15, 2022
1 parent fd3112b commit 9c7270d
Show file tree
Hide file tree
Showing 5 changed files with 49 additions and 36 deletions.
21 changes: 7 additions & 14 deletions docs/en/1_exist_data_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,30 +201,23 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16

You can check [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) for full arguments and environment variables.

You can also use pytorch original DDP with script `multinode_train.sh`. (This script also supports single machine training.)
If you launch with multiple machines simply connected with ethernet, you can simply run following commands:

For each machine, run
```shell
./tools/sh_train.sh ${CONFIG_FILE} ${NODE_NUM} ${NODE_RANK} ${MASTER_NODE_IP}
```

Here is an example of using 16 GPUs (2 nodes), the IP=10.10.10.10:
On the first machine:

run in node0:
```shell
./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 0 10.10.10.10
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR ./tools/dist_train.sh $CONFIG $GPUS
```

run in node1:
On the second machine:

```shell
./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 1 10.10.10.10
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR ./tools/dist_train.sh $CONFIG $GPUS
```


If you have just multiple machines connected within ethernet, you can refer to
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed.html).
Usually it is slow if you do not have high speed networking like InfiniBand.


### Launch multiple jobs on a single machine

If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,
Expand Down
16 changes: 15 additions & 1 deletion docs/zh_cn/1_exist_data_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,21 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16

你可以查看 [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) 来获取所有的参数和环境变量。

如果你有多个机器连接到以太网,可以参考 PyTorch 的 [launch utility](https://pytorch.org/docs/stable/distributed.html),如果你没有像 InfiniBand 一样的高速率网络,通常会很慢。
如果您想使用由 ethernet 连接起来的多台机器, 您可以使用以下命令:

在第一台机器上:

```shell
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR ./tools/dist_train.sh $CONFIG $GPUS
```

在第二台机器上:

```shell
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR ./tools/dist_train.sh $CONFIG $GPUS
```

但是,如果您不使用高速网路连接这几台机器的话,训练将会非常慢。

### 在单个机器上启动多个任务

Expand Down
16 changes: 14 additions & 2 deletions tools/dist_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,20 @@
CONFIG=$1
CHECKPOINT=$2
GPUS=$3
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
$(dirname "$0")/test.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}
python -m torch.distributed.launch \
--nnodes=$NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$GPUS \
--master_port=$PORT \
$(dirname "$0")/test.py \
$CONFIG \
$CHECKPOINT \
--launcher pytorch \
${@:4}
15 changes: 13 additions & 2 deletions tools/dist_train.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,19 @@

CONFIG=$1
GPUS=$2
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
$(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:3}
python -m torch.distributed.launch \
--nnodes=$NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$GPUS \
--master_port=$PORT \
$(dirname "$0")/train.py \
$CONFIG \
--seed 0 \
--launcher pytorch ${@:3}
17 changes: 0 additions & 17 deletions tools/multinode_train.sh

This file was deleted.

0 comments on commit 9c7270d

Please sign in to comment.