-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi node caffe #3441
Multi node caffe #3441
Conversation
Adding test results with Cifar-full on 5 machines (CPU only):
|
@cque7 Does this version support multi nodes installed with GPUs? Thanks! |
@AIROBOTAI technically it should support GPU, but we only tested it on CPU. The patch is build upon "caffe solvers" and we tried to avoid directly modifying caffe code, so somehow it should work on GPUs. |
@cque7 Wonderful! I may get down to try it in my computers. Also, it would help us a lot if you list out some matters need attention during usage. Thanks a lot! |
Thanks @AIROBOTAI , following is some steps to run the Cifar test:
Training process should be started with the 5 steps. If you want to test the trained model, run: ./build/tools/model_test Let me know if you meet problems. |
@cque7 thanks a lot! |
mkl_dnn integrated
Enable the CMAKE build system of multi_node project
Enable the make build system
Fixed build warnings
add OMP in dropout layer
Hi, @cypof I compiled the branch
This error seems to be related with openmpi. I download the source codes of openmpi-2.0.1 and installed it in my computer which has ubuntu14.04 as OS. Could you help me figure out where is the problem? |
Just got to know the Caffe-MPI which was released by Inspur several months ago. It supports multi-gpu/node training and is reported to achieve 10x speed-up on a cluster with 16 gpus. Hope it helps this branch. |
Moving the multi-node code to github/01org.
This version of code contains:
Tests on Alexnet with 54 machines on going
6 Nov Notes:
One of the main objective of this patch is to reduce communication cost when scale caffe to multiple nodes. Some deep convolution networks have large amounts of parameters (e.g. AlexNet has ~220Mbytes parameters), exchanging these parameters to parameter server is cost in a distributed environment.
![model_parallel](https://cloud.githubusercontent.com/assets/11864335/10989459/6def2bca-8482-11e5-9082-be790ccdf70a.png)
The patch is mainly motivated by the fact that, in CNN, FC layers contain most parameters but only consume a small percentage of time. E.g., authors in [2] report that FC layers in AlexNet have ~95% of parameters but only run with 5% time. The optimization opportunity is, suppose we have 20 nodes, since FC layers only consumes 5% of time, we can have only 1 node run all the FC layers and the remaining 19 nodes run the remain conv. layers. By doing this, we immediately reduces about 95% of communications caused by parameter exchanges. When we have more than 20 nodes, we can further split FC layers into smaller parts, an illustrative picture as following:
Another trick we used is called "dynamic pipelines". We maintain a pool of free solvers in each node (the solvers share parameters), and each message (pipeline) in the system is assigned to a unique msg_id. When a "forward" packets comes to a node, the node selects a free solver and associate it with the msg_id in the message. Doing this in the "forward" flow, we can create a virtual pipeline by connecting the sub-solvers in different nodes together. In backward, we just need to find the corresponding solver according to the msg_id, msg_id can also be aliased as pipeline_id here.
28 Oct Notes:
- Data parallel in convolution layers. Conv. weights are exchanged via parameter server. - Model parallel for fully connected layers. A model server is created as a centralized scheduler to split big models (e.g. Alex / VGG net) into smaller ones and generate routing tables to connect the full connected nodes together.[1] H Cui, exploiting bounded staleness to speed up big data analytics
[2] Alex, One weird trick for parallelizing convolutional neural networks