Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node caffe #3252

Closed
wants to merge 2 commits into from
Closed

Multi-node caffe #3252

wants to merge 2 commits into from

Conversation

cque7
Copy link

@cque7 cque7 commented Oct 28, 2015

We are working to scale caffe to hundreds of nodes in a cluster. This version of change contains:
Data parallel in convolution layers. Conv. weights are exchanged via parameter server.
Model parallel for fully connected layers. A model server is created as a centralized scheduler to split big models (e.g. Alex / VGG net) into smaller ones and generate routing tables to connect the full connected nodes together.

@ronghanghu
Copy link
Member

@cque7 Thanks for the effort! One of our major goal is to coordinate and reconcile different types of parallelism (data / model / hybrid, CPU / GPU) such as #2903 and #2219.

One issue is to ensure data coordination in parallel. Currently in Caffe there are multiple types of data layers, some load data via random access (this is the best case to parallelize) and other can only do sequential access, and some generate data on the fly. Also, there can be more than one data layer in a net, so the multiple data layers need to coordinate with each other. This is one major difficulty we have for parallelism (achieving maximum possible scale in an elegant way without losing generality).

I shall try to review this PR after CVPR deadline (11/06/15).

@cque7
Copy link
Author

cque7 commented Nov 6, 2015

Thanks @ronghanghu for the reply. We look forward to get comments from community to make it better.
One of the main objective of this patch is to reduce communication cost when scale caffe to multiple nodes. Some deep convolution networks have large amounts of parameters (e.g. AlexNet has ~220Mbytes parameters), exchanging these parameters to parameter server is cost in a distributed environment.
The patch is mainly motivated by the fact that, in CNN, FC layers contain most parameters but only consume a small percentage of time. E.g., authors in [1] report that FC layers in AlexNet have ~95% of parameters but only run with 5% time. The optimization opportunity is, suppose we have 20 nodes, since FC layers only consumes 5% of time, we can have only 1 node run all the FC layers and the remaining 19 nodes run the remain conv. layers. By doing this, we immediately reduces about 95% of communications caused by parameter exchanges. When we have more than 20 nodes, we can further split FC layers into smaller parts, an illustrative picture as following:
model_parallel

Another trick we used is called "dynamic pipelines". We maintain a pool of free solvers in each node (the solvers share parameters), and each message (pipeline) in the system is assigned to a unique msg_id. When a "forward" packets comes to a node, the node selects a free solver and associate it with the msg_id in the message. Doing this in the "forward" flow, we can create a virtual pipeline by connecting the sub-solvers in different nodes together. In backward, we just need to find the corresponding solver according to the msg_id, msg_id can also be aliased as pipeline_id here.

[1] Alex, One weird trick for parallelizing convolutional neural networks

@bhack
Copy link
Contributor

bhack commented Nov 8, 2015

@cque7 Are you doing this for Intel?

@cque7
Copy link
Author

cque7 commented Nov 8, 2015

@bhack yes, it's a part of our open source effort to make deep learning frameworks such as caffe work well on Intel platforms.

@bhack
Copy link
Contributor

bhack commented Nov 8, 2015

@cque7 Nvidia, AMD and Intel around here. What do you think of DSL like Halide lang?

@bhack
Copy link
Contributor

bhack commented Nov 8, 2015

/cc @longjon Any plan with Halide for single node vectorization for multi back end optimizzation and scheduling?

@cque7
Copy link
Author

cque7 commented Nov 9, 2015

@bhack scheduling is an interesting problem, in the patch I try to deal with this problem at system level. Each node in the system consists of a polling thread and several worker threads. Worker threads can run in CPU or GPU context. Polling thread is responsible for scheduling by dispatching the incoming message to different worker thread (each thread runs a different solver, parameters are shared among these threads).

node_arch

In nodes which run conv. layers or partial FC layers, the worker threads are only responsible for do forward and backward, and there is a Parameter thread which is used to communicate with parameter server and update the parameters.
When a worker thread finishes backward, it sends a short notification message to the parameter thread. The parameter thread then will communicate with parameter server and do parameter update for both GPU and CPU parameters.

@bhack
Copy link
Contributor

bhack commented Nov 9, 2015

@cque7 Interesting. Have you seen Caffe experiments in http://arxiv.org/abs/1506.08272? At device level what are your plan? Rely on Opencl like in #2610? /cc @cypof @naibaf7

@cque7
Copy link
Author

cque7 commented Nov 9, 2015

@bhack we are planing to test the proposal on devices like Intel's Xeon E3 servers which have integrated GPUs. For OCL implementations, we are looking at pull requests like #2610 , inside Intel we also have some OCL kernels which are optimized for Intel's platform. I think we will test the multi-node implementation on both of these kernels.

@bhack
Copy link
Contributor

bhack commented Nov 9, 2015

@cque7 That's why i supposed if could be possible to define kernels in a DSL and let vendors optimize targets. I.e. Halide can already produce for Cuda, Opencl, SSE, Neon, Android Renderscript and starting on Apple Metal targets. Also OpenCV it is going in a similar direction. See opencv/opencv#5510

@cque7
Copy link
Author

cque7 commented Dec 11, 2015

Closing this PR since we are moving it to #3441

@cque7 cque7 closed this Dec 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants