Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi node caffe #3441

Closed
wants to merge 76 commits into from
Closed

Multi node caffe #3441

wants to merge 76 commits into from

Conversation

cque7
Copy link

@cque7 cque7 commented Dec 11, 2015

Moving the multi-node code to github/01org.
This version of code contains:

  • Tests on Cifar-full using 5 machines show that this multi-node code converges correctly.
    Tests on Alexnet with 54 machines on going
  • Graph-based model server which can handle more general solver configurations.
  • Parameter server supports BSP, A-BSP, SSP [1], which is configurable by users. With a centralized model server for scheduling, we were able to build a light weighted PS without other parts like name nodes etc.
6 Nov Notes:

One of the main objective of this patch is to reduce communication cost when scale caffe to multiple nodes. Some deep convolution networks have large amounts of parameters (e.g. AlexNet has ~220Mbytes parameters), exchanging these parameters to parameter server is cost in a distributed environment.
The patch is mainly motivated by the fact that, in CNN, FC layers contain most parameters but only consume a small percentage of time. E.g., authors in [2] report that FC layers in AlexNet have ~95% of parameters but only run with 5% time. The optimization opportunity is, suppose we have 20 nodes, since FC layers only consumes 5% of time, we can have only 1 node run all the FC layers and the remaining 19 nodes run the remain conv. layers. By doing this, we immediately reduces about 95% of communications caused by parameter exchanges. When we have more than 20 nodes, we can further split FC layers into smaller parts, an illustrative picture as following:
model_parallel

Another trick we used is called "dynamic pipelines". We maintain a pool of free solvers in each node (the solvers share parameters), and each message (pipeline) in the system is assigned to a unique msg_id. When a "forward" packets comes to a node, the node selects a free solver and associate it with the msg_id in the message. Doing this in the "forward" flow, we can create a virtual pipeline by connecting the sub-solvers in different nodes together. In backward, we just need to find the corresponding solver according to the msg_id, msg_id can also be aliased as pipeline_id here.

28 Oct Notes:
- Data parallel in convolution layers. Conv. weights are exchanged via parameter server. - Model parallel for fully connected layers. A model server is created as a centralized scheduler to split big models (e.g. Alex / VGG net) into smaller ones and generate routing tables to connect the full connected nodes together.

[1] H Cui, exploiting bounded staleness to speed up big data analytics
[2] Alex, One weird trick for parallelizing convolutional neural networks

@cque7 cque7 mentioned this pull request Dec 11, 2015
@cque7
Copy link
Author

cque7 commented Dec 11, 2015

Adding test results with Cifar-full on 5 machines (CPU only):

  • About 4x speed up with atlas and 5 machines connected by 1gbps network.
  • Both multi-node and single node run 60, 000 iterations of training batches.
  • Convergence: with fixed random seed 2, 5-node's test accuracy converges at 0.794 while single node converges at 0.782, seems that multi-node gets around 1% accuracy gain.
  • The randomness introduced by multi-node seems to be helpful in training. E.g. with 5 nodes we got 0.783 test accuracy at iteration 30, 000 while it takes the single node 60, 000 iterations to get test accuracy 0.782.

converge

@AIROBOTAI
Copy link

@cque7 Does this version support multi nodes installed with GPUs? Thanks!

@cque7
Copy link
Author

cque7 commented Dec 22, 2015

@AIROBOTAI technically it should support GPU, but we only tested it on CPU. The patch is build upon "caffe solvers" and we tried to avoid directly modifying caffe code, so somehow it should work on GPUs.

@AIROBOTAI
Copy link

@cque7 Wonderful! I may get down to try it in my computers. Also, it would help us a lot if you list out some matters need attention during usage. Thanks a lot!

@cque7
Copy link
Author

cque7 commented Dec 22, 2015

Thanks @AIROBOTAI , following is some steps to run the Cifar test:

  1. Install libzmq and build. I am using zmq version of 4.0.5, and cmake isn't supported right now.
  2. Start model server: ./build/tools/model_server
  3. Start parameter server: ./build/tools/param_server
  4. Start FC layers: ./build/tools/fc_server
  5. Start conv client: ./build/tools/conv_client

Training process should be started with the 5 steps.

If you want to test the trained model, run: ./build/tools/model_test
Model test client pulls parameters in parameter server and FC nodes to get a single model for tests and snapshots.

Let me know if you meet problems.

@AIROBOTAI
Copy link

@cque7 thanks a lot!

@AIROBOTAI
Copy link

AIROBOTAI commented Sep 9, 2016

Hi, @cypof I compiled the branch multi_node in github/01org, but I came across a compilation error:

NVCC src/caffe/util/math_functions.cu
nvcc fatal   : Unknown option 'fopenmp'

This error seems to be related with openmpi. I download the source codes of openmpi-2.0.1 and installed it in my computer which has ubuntu14.04 as OS. Could you help me figure out where is the problem?

@AIROBOTAI
Copy link

Just got to know the Caffe-MPI which was released by Inspur several months ago. It supports multi-gpu/node training and is reported to achieve 10x speed-up on a cluster with 16 gpus.

Hope it helps this branch.

@cque7 cque7 closed this Nov 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants