Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
imagenet-resnet-horovod.py		imagenet-resnet-horovod.py
imagenet_utils.py		imagenet_utils.py
resnet_model.py		resnet_model.py
serve-data.py		serve-data.py
slurm.script		slurm.script

README.md

Tensorpack + Horovod

Multi-GPU / distributed training on ImageNet, with TensorFlow + Tensorpack + Horovod.

It reproduces the settings in the paper

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

The code is annotated with sentences from the paper.

Based on this baseline implementation, we implemented adversarial training and obtained ImageNet classifiers with state-of-the-art adversarial robustness. See our code release at facebookresearch/ImageNet-Adversarial-Training.

Dependencies:

TensorFlow>=1.5, tensorpack>=0.8.5.
Horovod with NCCL support. See doc for its installation instructions.
zmq_ops: optional but recommended.
Prepare ImageNet data into this structure.

Run:

# Single Machine, Multiple GPUs:
# Run the following two commands together:
$ ./serve-data.py --data ~/data/imagenet/ --batch 64
$ mpirun -np 8 --output-filename test.log python3 ./imagenet-resnet-horovod.py -d 50 --data ~/data/imagenet/ --batch 64

# Multiple Machines with RoCE/IB:
host1$ ./serve-data.py --data ~/data/imagenet/ --batch 64
host2$ ./serve-data.py --data ~/data/imagenet/ --batch 64
$ mpirun -np 16 -H host1:8,host2:8 --output-filename test.log \
		-bind-to none -map-by slot -mca pml ob1 \
	  -x NCCL_IB_CUDA_SUPPORT=1 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO \
		-x PATH -x PYTHONPATH -x LD_LIBRARY_PATH \
		python3 ./imagenet-resnet-horovod.py -d 50 \
        --data ~/data/imagenet/ --batch 64 --validation distributed

Notes:

MPI does not like fork(), so running serve-data.py inside MPI is not a good idea.
You may tune the best mca & NCCL options for your own systems. See horovod docs for details. Note that TCP connection will then have much worse scaling efficiency.
To train on small datasets, you don't need a separate data serving process or zmq ops. You can simply load data inside each training process with its own data loader. The main motivation to use a separate data loader is to avoid fork() inside MPI and to make it easier to benchmark.
You can pass --no-zmq-ops to both scripts, to use Python for communication instead of the faster zmq_ops.
If you're using slurm in a cluster, checkout an example sbatch script.

Performance Benchmark:

# To benchmark data speed:
$ ./serve-data.py --data ~/data/imagenet/ --batch 64 --benchmark
# To benchmark training with fake data:
# Run the training command with `--fake`

Distributed ResNet50 Results:

devices	batch per GPU	time ¹	top1 err ³
32 P100s	64	5h9min	23.73%
128 P100s	32	1h40min	23.62%
128 P100s	64	1h23min	23.97%
256 P100s	32	1h9min ²	23.90%

1: Validation time excluded from total time. Time depends on your hardware.

2: This corresponds to exactly the "1 hour" setting in the original paper.

3: The final error typically has ±0.1 or more fluctuation according to the paper.

Although the code does not scale very ideally with 32 machines, it does scale with 90+% efficiency on 2 or 4 machines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResNet-Horovod

ResNet-Horovod

README.md

Tensorpack + Horovod

Dependencies:

Run:

Performance Benchmark:

Distributed ResNet50 Results:

Files

ResNet-Horovod

Directory actions

More options

Directory actions

More options

Latest commit

History

ResNet-Horovod

Folders and files

parent directory

README.md

Tensorpack + Horovod

Dependencies:

Run:

Performance Benchmark:

Distributed ResNet50 Results: