Skip to content

Latest commit

 

History

History
 
 

ResNet-Horovod

Tensorpack + Horovod

Multi-GPU / distributed training on ImageNet, with TensorFlow + Tensorpack + Horovod.

It reproduces the settings in the paper

The code is annotated with sentences from the paper.

Based on this baseline implementation, we implemented adversarial training and obtained ImageNet classifiers with state-of-the-art adversarial robustness. See our code release at facebookresearch/ImageNet-Adversarial-Training.

Dependencies:

  • TensorFlow>=1.5, tensorpack>=0.8.5.
  • Horovod with NCCL support. See doc for its installation instructions.
  • zmq_ops: optional but recommended.
  • Prepare ImageNet data into this structure.

Run:

# Single Machine, Multiple GPUs:
# Run the following two commands together:
$ ./serve-data.py --data ~/data/imagenet/ --batch 64
$ mpirun -np 8 --output-filename test.log python3 ./imagenet-resnet-horovod.py -d 50 --data ~/data/imagenet/ --batch 64
# Multiple Machines with RoCE/IB:
host1$ ./serve-data.py --data ~/data/imagenet/ --batch 64
host2$ ./serve-data.py --data ~/data/imagenet/ --batch 64
$ mpirun -np 16 -H host1:8,host2:8 --output-filename test.log \
		-bind-to none -map-by slot -mca pml ob1 \
	  -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO \
		-x PATH -x PYTHONPATH -x LD_LIBRARY_PATH \
		python3 ./imagenet-resnet-horovod.py -d 50 \
        --data ~/data/imagenet/ --batch 64 --validation distributed

Notes:

  1. MPI does not like fork(), so running serve-data.py inside MPI is not a good idea.
  2. You may tune the best mca & NCCL options for your own systems. See horovod docs for details. Note that TCP connection will then have much worse scaling efficiency.
  3. To train on small datasets, you don't need a separate data serving process or zmq ops. You can simply load data inside each training process with its own data loader. The main motivation to use a separate data loader is to avoid fork() inside MPI and to make it easier to benchmark.
  4. You can pass --no-zmq-ops to both scripts, to use Python for communication instead of the faster zmq_ops.
  5. If you're using slurm in a cluster, checkout an example sbatch script.

Performance Benchmark:

# To benchmark data speed:
$ ./serve-data.py --data ~/data/imagenet/ --batch 64 --benchmark
# To benchmark training with fake data:
# Run the training command with `--fake`

Distributed ResNet50 Results:

devices batch per GPU time 1 top1 err 3
32 P100s 64 5h9min 23.73%
128 P100s 32 1h40min 23.62%
128 P100s 64 1h23min 23.97%
256 P100s 32 1h9min 2 23.90%

1: Validation time excluded from total time. Time depends on your hardware.

2: This corresponds to exactly the "1 hour" setting in the original paper.

3: The final error typically has ±0.1 or more fluctuation according to the paper.

Although the code does not scale very ideally with 32 machines, it does scale with 90+% efficiency on 2 or 4 machines.