[Doc] Add documentation for multi-node train with pytorch original ddp (

#1296) * update mn_train * update * Fix typos Co-authored-by: Tai-Wang <[email protected]>
open-mmlab · Mar 9, 2022 · fd3112b · fd3112b
1 parent 33de208
commit fd3112b
Show file tree

Hide file tree

Showing 2 changed files with 38 additions and 1 deletion.
diff --git a/docs/en/1_exist_data_model.md b/docs/en/1_exist_data_model.md
@@ -201,7 +201,27 @@ GPUS=16 ./tools/slurm_train.sh dev pp_kitti_3class hv_pointpillars_secfpn_6x8_16
 
 You can check [slurm_train.sh](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) for full arguments and environment variables.
 
-If you have just multiple machines connected with ethernet, you can refer to
+You can also use pytorch original DDP with script `multinode_train.sh`. (This script also supports single machine training.)
+
+For each machine, run
+```shell
+./tools/sh_train.sh ${CONFIG_FILE} ${NODE_NUM} ${NODE_RANK} ${MASTER_NODE_IP}
+```
+
+Here is an example of using 16 GPUs (2 nodes), the IP=10.10.10.10:
+
+run in node0: 
+```shell
+./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 0 10.10.10.10
+```
+
+run in node1: 
+```shell
+./tools/sh_train.sh hv_pointpillars_secfpn_6x8_160e_kitti-3d-3class.py 2 1 10.10.10.10
+```
+
+
+If you have just multiple machines connected within ethernet, you can refer to
 PyTorch [launch utility](https://pytorch.org/docs/stable/distributed.html).
 Usually it is slow if you do not have high speed networking like InfiniBand.
 

diff --git a/tools/multinode_train.sh b/tools/multinode_train.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+
+set -e
+set -x
+
+CONFIG=$1
+NODE_NUM=$2
+NODE_RANK=$3
+MASTER_ADDR=$4
+
+
+PORT=${PORT:-29500}
+
+PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
+python -m torch.distributed.launch --nproc_per_node=8 --master_port=$PORT \
+    --nnodes=$NODE_NUM  --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR \
+    $(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:5}