Multi node caffe #3441

cque7 · 2015-12-11T03:41:08Z

Moving the multi-node code to github/01org.
This version of code contains:

Tests on Cifar-full using 5 machines show that this multi-node code converges correctly.
Tests on Alexnet with 54 machines on going
Graph-based model server which can handle more general solver configurations.
Parameter server supports BSP, A-BSP, SSP [1], which is configurable by users. With a centralized model server for scheduling, we were able to build a light weighted PS without other parts like name nodes etc.

6 Nov Notes:

One of the main objective of this patch is to reduce communication cost when scale caffe to multiple nodes. Some deep convolution networks have large amounts of parameters (e.g. AlexNet has ~220Mbytes parameters), exchanging these parameters to parameter server is cost in a distributed environment.
The patch is mainly motivated by the fact that, in CNN, FC layers contain most parameters but only consume a small percentage of time. E.g., authors in [2] report that FC layers in AlexNet have ~95% of parameters but only run with 5% time. The optimization opportunity is, suppose we have 20 nodes, since FC layers only consumes 5% of time, we can have only 1 node run all the FC layers and the remaining 19 nodes run the remain conv. layers. By doing this, we immediately reduces about 95% of communications caused by parameter exchanges. When we have more than 20 nodes, we can further split FC layers into smaller parts, an illustrative picture as following:

Another trick we used is called "dynamic pipelines". We maintain a pool of free solvers in each node (the solvers share parameters), and each message (pipeline) in the system is assigned to a unique msg_id. When a "forward" packets comes to a node, the node selects a free solver and associate it with the msg_id in the message. Doing this in the "forward" flow, we can create a virtual pipeline by connecting the sub-solvers in different nodes together. In backward, we just need to find the corresponding solver according to the msg_id, msg_id can also be aliased as pipeline_id here.

28 Oct Notes:

- Data parallel in convolution layers. Conv. weights are exchanged via parameter server. - Model parallel for fully connected layers. A model server is created as a centralized scheduler to split big models (e.g. Alex / VGG net) into smaller ones and generate routing tables to connect the full connected nodes together.

[1] H Cui, exploiting bounded staleness to speed up big data analytics
[2] Alex, One weird trick for parallelizing convolutional neural networks

multi_node caffe 0.1

cque7 · 2015-12-11T04:01:00Z

Adding test results with Cifar-full on 5 machines (CPU only):

About 4x speed up with atlas and 5 machines connected by 1gbps network.
Both multi-node and single node run 60, 000 iterations of training batches.
Convergence: with fixed random seed 2, 5-node's test accuracy converges at 0.794 while single node converges at 0.782, seems that multi-node gets around 1% accuracy gain.
The randomness introduced by multi-node seems to be helpful in training. E.g. with 5 nodes we got 0.783 test accuracy at iteration 30, 000 while it takes the single node 60, 000 iterations to get test accuracy 0.782.

AIROBOTAI · 2015-12-21T07:12:56Z

@cque7 Does this version support multi nodes installed with GPUs? Thanks!

cque7 · 2015-12-22T06:29:20Z

@AIROBOTAI technically it should support GPU, but we only tested it on CPU. The patch is build upon "caffe solvers" and we tried to avoid directly modifying caffe code, so somehow it should work on GPUs.

AIROBOTAI · 2015-12-22T06:41:36Z

@cque7 Wonderful! I may get down to try it in my computers. Also, it would help us a lot if you list out some matters need attention during usage. Thanks a lot!

cque7 · 2015-12-22T07:10:20Z

Thanks @AIROBOTAI , following is some steps to run the Cifar test:

Install libzmq and build. I am using zmq version of 4.0.5, and cmake isn't supported right now.
Start model server: ./build/tools/model_server
Start parameter server: ./build/tools/param_server
Start FC layers: ./build/tools/fc_server
Start conv client: ./build/tools/conv_client

Training process should be started with the 5 steps.

If you want to test the trained model, run: ./build/tools/model_test
Model test client pulls parameters in parameter server and FC nodes to get a single model for tests and snapshots.

Let me know if you meet problems.

AIROBOTAI · 2015-12-22T08:09:51Z

@cque7 thanks a lot!

mkl_dnn integrated

Enable the CMAKE build system of multi_node project

Enable the make build system

Fixed build warnings

add OMP in dropout layer

AIROBOTAI · 2016-09-09T13:28:49Z

Hi, @cypof I compiled the branch multi_node in github/01org, but I came across a compilation error:

NVCC src/caffe/util/math_functions.cu
nvcc fatal   : Unknown option 'fopenmp'

This error seems to be related with openmpi. I download the source codes of openmpi-2.0.1 and installed it in my computer which has ubuntu14.04 as OS. Could you help me figure out where is the problem?

AIROBOTAI · 2016-10-21T14:09:54Z

Just got to know the Caffe-MPI which was released by Inspur several months ago. It supports multi-gpu/node training and is reported to achieve 10x speed-up on a cluster with 16 gpus.

Hope it helps this branch.

Que, Kevin added 5 commits November 6, 2015 08:45

Multi-node caffe

62193de

Merge remote-tracking branch 'remotes/upstream/master'

59fbcfa

Merge pull request #1 from cque7/master

eba7b08

multi_node caffe 0.1

release 0.2: with cifar tested, general model server and configurable PS

d7513d4

add missed files

c5d8d23

cque7 mentioned this pull request Dec 11, 2015

Multi-node caffe #3252

Closed

Que, Kevin and others added 18 commits December 23, 2015 06:36

fix model map bug and change default tcp port to larger than 1024

73359fb

overlap commucation and computation, priority queue for backward

ce8793e

optimize node registration

89bba2f

Merge branch 'master' of https://github.com/01org/caffe into multi_node

d684d8b

merge async data layer

8f5b778

multi-node alpha version

55816b6

fix model test bug

5fef3cf

Updated multi_node code to align with mkl-dnn usage

fb40a79

updated cifa10 example for unit test usage

e7ce1da

porting mkl_dnn layers to multinode

3d1a2fb

Merge pull request #16 from listenlink/multi_node

2aed73a

mkl_dnn integrated

Enable the CMAKE build system of multi_node project

2c80ae2

Merge pull request #17 from listenlink/multi_node

f0fffcd

Enable the CMAKE build system of multi_node project

Enalbed Make build system

ab3e6b7

Merge pull request #20 from listenlink/enable_makefile

1569a37

Enable the make build system

Fixed build warnings

50ab802

revert cifar10_full_solver file to origin version

9807c51

Merge pull request #21 from listenlink/enable_makefile

cb128b0

Fixed build warnings

Que, Kevin and others added 17 commits August 19, 2016 02:56

sk_sock lint

7dbaad8

fc thread lint

6ca3afa

lint model test thread

039829f

lint node env model test

3cd0ea8

lint conv_thread

634eeb7

lint fc_node

1c49d9d

lint model_map

e336976

lint conv_node

928eee8

fix lint errors

95abbf0

use more mkl threads in fc param thread

1ff7603

use num of fc thread

f92f755

fix lint

c768c37

add debug log for multi node

27f98dc

Refine debug logs and macro definitions

6b05ded

add OMP in dropout layer

eae3a7d

Merge pull request #23 from tensor-tang/multi_node

87bcc08

add OMP in dropout layer

add exit

e8b03bf

fix lint errors

e515f6e

Que, Kevin and others added 9 commits November 18, 2016 07:49

thread optimization

ceaa707

reduce tree

127d0b8

one thread more clients

ef60ac6

add OMP in loss layer

ac69842

fix exit

3039428

fix image data bug

c725d8e

ResNet50 DNN fail to converge

43ba978

Add GoogleNet V2 proto

73a759d

Add ResNet50 Proto

eda3c6e

cque7 closed this Nov 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi node caffe #3441

Multi node caffe #3441

cque7 commented Dec 11, 2015

cque7 commented Dec 11, 2015

AIROBOTAI commented Dec 21, 2015

cque7 commented Dec 22, 2015

AIROBOTAI commented Dec 22, 2015

cque7 commented Dec 22, 2015

AIROBOTAI commented Dec 22, 2015

AIROBOTAI commented Sep 9, 2016 •

edited

Loading

AIROBOTAI commented Oct 21, 2016

Multi node caffe #3441

Multi node caffe #3441

Conversation

cque7 commented Dec 11, 2015

6 Nov Notes:

28 Oct Notes:

cque7 commented Dec 11, 2015

AIROBOTAI commented Dec 21, 2015

cque7 commented Dec 22, 2015

AIROBOTAI commented Dec 22, 2015

cque7 commented Dec 22, 2015

AIROBOTAI commented Dec 22, 2015

AIROBOTAI commented Sep 9, 2016 • edited Loading

AIROBOTAI commented Oct 21, 2016

AIROBOTAI commented Sep 9, 2016 •

edited

Loading