Multi-GPU / distributed training on ImageNet, with TensorFlow + Tensorpack + Horovod.
It reproduces the settings in the paper
The code is annotated with sentences from the paper.
Based on this baseline implementation, we implemented adversarial training and obtained ImageNet classifiers with state-of-the-art adversarial robustness. See our code release at facebookresearch/ImageNet-Adversarial-Training.
- TensorFlow>=1.5, tensorpack>=0.8.5.
- Horovod with NCCL support. See doc for its installation instructions.
- zmq_ops: optional but recommended.
- Prepare ImageNet data into this structure.
# Single Machine, Multiple GPUs:
# Run the following two commands together:
$ ./serve-data.py --data ~/data/imagenet/ --batch 64
$ mpirun -np 8 --output-filename test.log python3 ./imagenet-resnet-horovod.py -d 50 --data ~/data/imagenet/ --batch 64
# Multiple Machines with RoCE/IB:
host1$ ./serve-data.py --data ~/data/imagenet/ --batch 64
host2$ ./serve-data.py --data ~/data/imagenet/ --batch 64
$ mpirun -np 16 -H host1:8,host2:8 --output-filename test.log \
-bind-to none -map-by slot -mca pml ob1 \
-x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO \
-x PATH -x PYTHONPATH -x LD_LIBRARY_PATH \
python3 ./imagenet-resnet-horovod.py -d 50 \
--data ~/data/imagenet/ --batch 64 --validation distributed
Notes:
- MPI does not like fork(), so running
serve-data.py
inside MPI is not a good idea. - You may tune the best mca & NCCL options for your own systems. See horovod docs for details. Note that TCP connection will then have much worse scaling efficiency.
- To train on small datasets, you don't need a separate data serving process or zmq ops. You can simply load data inside each training process with its own data loader. The main motivation to use a separate data loader is to avoid fork() inside MPI and to make it easier to benchmark.
- You can pass
--no-zmq-ops
to both scripts, to use Python for communication instead of the faster zmq_ops. - If you're using slurm in a cluster, checkout an example sbatch script.
# To benchmark data speed:
$ ./serve-data.py --data ~/data/imagenet/ --batch 64 --benchmark
# To benchmark training with fake data:
# Run the training command with `--fake`
devices | batch per GPU | time 1 | top1 err 3 |
---|---|---|---|
32 P100s | 64 | 5h9min | 23.73% |
128 P100s | 32 | 1h40min | 23.62% |
128 P100s | 64 | 1h23min | 23.97% |
256 P100s | 32 | 1h9min 2 | 23.90% |
1: Validation time excluded from total time. Time depends on your hardware.
2: This corresponds to exactly the "1 hour" setting in the original paper.
3: The final error typically has ±0.1 or more fluctuation according to the paper.
Although the code does not scale very ideally with 32 machines, it does scale with 90+% efficiency on 2 or 4 machines.