Skip to content

Commit

Permalink
[Doc] Add documentation for multi-node train with pytorch original ddp (
Browse files Browse the repository at this point in the history
#1296)

* update mn_train

* update

* Fix typos

Co-authored-by: Tai-Wang <[email protected]>
  • Loading branch information
xieenze and Tai-Wang authored Mar 9, 2022
1 parent 33de208 commit fd3112b
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 1 deletion.
22 changes: 21 additions & 1 deletion docs/en/1_exist_data_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,27 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16

You can check [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) for full arguments and environment variables.

If you have just multiple machines connected with ethernet, you can refer to
You can also use pytorch original DDP with script `multinode_train.sh`. (This script also supports single machine training.)

For each machine, run
```shell
./tools/sh_train.sh ${CONFIG_FILE} ${NODE_NUM} ${NODE_RANK} ${MASTER_NODE_IP}
```

Here is an example of using 16 GPUs (2 nodes), the IP=10.10.10.10:

run in node0:
```shell
./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 0 10.10.10.10
```

run in node1:
```shell
./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 1 10.10.10.10
```


If you have just multiple machines connected within ethernet, you can refer to
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed.html).
Usually it is slow if you do not have high speed networking like InfiniBand.

Expand Down
17 changes: 17 additions & 0 deletions tools/multinode_train.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/env bash

set -e
set -x

CONFIG=$1
NODE_NUM=$2
NODE_RANK=$3
MASTER_ADDR=$4


PORT=${PORT:-29500}

PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
python -m torch.distributed.launch --nproc_per_node=8 --master_port=$PORT \
--nnodes=$NODE_NUM --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR \
$(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:5}

0 comments on commit fd3112b

Please sign in to comment.