Multi-node caffe #3252

cque7 · 2015-10-28T03:26:41Z

We are working to scale caffe to hundreds of nodes in a cluster. This version of change contains:
Data parallel in convolution layers. Conv. weights are exchanged via parameter server.
Model parallel for fully connected layers. A model server is created as a centralized scheduler to split big models (e.g. Alex / VGG net) into smaller ones and generate routing tables to connect the full connected nodes together.

ronghanghu · 2015-10-28T05:33:15Z

@cque7 Thanks for the effort! One of our major goal is to coordinate and reconcile different types of parallelism (data / model / hybrid, CPU / GPU) such as #2903 and #2219.

One issue is to ensure data coordination in parallel. Currently in Caffe there are multiple types of data layers, some load data via random access (this is the best case to parallelize) and other can only do sequential access, and some generate data on the fly. Also, there can be more than one data layer in a net, so the multiple data layers need to coordinate with each other. This is one major difficulty we have for parallelism (achieving maximum possible scale in an elegant way without losing generality).

I shall try to review this PR after CVPR deadline (11/06/15).

cque7 · 2015-11-06T04:41:41Z

Thanks @ronghanghu for the reply. We look forward to get comments from community to make it better.
One of the main objective of this patch is to reduce communication cost when scale caffe to multiple nodes. Some deep convolution networks have large amounts of parameters (e.g. AlexNet has ~220Mbytes parameters), exchanging these parameters to parameter server is cost in a distributed environment.
The patch is mainly motivated by the fact that, in CNN, FC layers contain most parameters but only consume a small percentage of time. E.g., authors in [1] report that FC layers in AlexNet have ~95% of parameters but only run with 5% time. The optimization opportunity is, suppose we have 20 nodes, since FC layers only consumes 5% of time, we can have only 1 node run all the FC layers and the remaining 19 nodes run the remain conv. layers. By doing this, we immediately reduces about 95% of communications caused by parameter exchanges. When we have more than 20 nodes, we can further split FC layers into smaller parts, an illustrative picture as following:

Another trick we used is called "dynamic pipelines". We maintain a pool of free solvers in each node (the solvers share parameters), and each message (pipeline) in the system is assigned to a unique msg_id. When a "forward" packets comes to a node, the node selects a free solver and associate it with the msg_id in the message. Doing this in the "forward" flow, we can create a virtual pipeline by connecting the sub-solvers in different nodes together. In backward, we just need to find the corresponding solver according to the msg_id, msg_id can also be aliased as pipeline_id here.

[1] Alex, One weird trick for parallelizing convolutional neural networks

bhack · 2015-11-08T00:45:57Z

@cque7 Are you doing this for Intel?

cque7 · 2015-11-08T12:23:27Z

@bhack yes, it's a part of our open source effort to make deep learning frameworks such as caffe work well on Intel platforms.

bhack · 2015-11-08T16:56:27Z

@cque7 Nvidia, AMD and Intel around here. What do you think of DSL like Halide lang?

bhack · 2015-11-08T17:24:38Z

/cc @longjon Any plan with Halide for single node vectorization for multi back end optimizzation and scheduling?

cque7 · 2015-11-09T01:16:53Z

@bhack scheduling is an interesting problem, in the patch I try to deal with this problem at system level. Each node in the system consists of a polling thread and several worker threads. Worker threads can run in CPU or GPU context. Polling thread is responsible for scheduling by dispatching the incoming message to different worker thread (each thread runs a different solver, parameters are shared among these threads).

In nodes which run conv. layers or partial FC layers, the worker threads are only responsible for do forward and backward, and there is a Parameter thread which is used to communicate with parameter server and update the parameters.
When a worker thread finishes backward, it sends a short notification message to the parameter thread. The parameter thread then will communicate with parameter server and do parameter update for both GPU and CPU parameters.

bhack · 2015-11-09T07:59:11Z

@cque7 Interesting. Have you seen Caffe experiments in http://arxiv.org/abs/1506.08272? At device level what are your plan? Rely on Opencl like in #2610? /cc @cypof @naibaf7

cque7 · 2015-11-09T09:28:26Z

@bhack we are planing to test the proposal on devices like Intel's Xeon E3 servers which have integrated GPUs. For OCL implementations, we are looking at pull requests like #2610 , inside Intel we also have some OCL kernels which are optimized for Intel's platform. I think we will test the multi-node implementation on both of these kernels.

bhack · 2015-11-09T09:38:16Z

@cque7 That's why i supposed if could be possible to define kernels in a DSL and let vendors optimize targets. I.e. Halide can already produce for Cuda, Opencl, SSE, Neon, Android Renderscript and starting on Apple Metal targets. Also OpenCV it is going in a similar direction. See opencv/opencv#5510

cque7 · 2015-12-11T03:43:19Z

Closing this PR since we are moving it to #3441

ronghanghu added the parallelism label Oct 28, 2015

futurely mentioned this pull request Oct 29, 2015

The case for splitting model class apache/mxnet#417

Closed

Multi-node caffe

62193de

ozabluda mentioned this pull request Nov 10, 2015

[October 2015] Intel are CPU magicians. But there's no one weird trick.... soumith/convnet-benchmarks#59

Open

lukeyeager mentioned this pull request Nov 11, 2015

Any plan to distributed DIGITS? NVIDIA/DIGITS#412

Closed

Merge remote-tracking branch 'remotes/upstream/master'

59fbcfa

cque7 closed this Dec 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node caffe #3252

Multi-node caffe #3252

cque7 commented Oct 28, 2015

ronghanghu commented Oct 28, 2015

cque7 commented Nov 6, 2015

bhack commented Nov 8, 2015

cque7 commented Nov 8, 2015

bhack commented Nov 8, 2015

bhack commented Nov 8, 2015

cque7 commented Nov 9, 2015

bhack commented Nov 9, 2015

cque7 commented Nov 9, 2015

bhack commented Nov 9, 2015

cque7 commented Dec 11, 2015

Multi-node caffe #3252

Multi-node caffe #3252

Conversation

cque7 commented Oct 28, 2015

ronghanghu commented Oct 28, 2015

cque7 commented Nov 6, 2015

bhack commented Nov 8, 2015

cque7 commented Nov 8, 2015

bhack commented Nov 8, 2015

bhack commented Nov 8, 2015

cque7 commented Nov 9, 2015

bhack commented Nov 9, 2015

cque7 commented Nov 9, 2015

bhack commented Nov 9, 2015

cque7 commented Dec 11, 2015