Multigpu Feature #3769

dzhwinter · 2017-08-30T23:55:46Z

here is better for review.
fix #3651

helinwang · 2017-09-04T21:41:30Z

paddle/framework/multigpu.md

+
+- GPU Model Parallelism
+
+  every GPU have `1/n` part of training data, and only have part of a complete model in GPU memory.


模型并行的数据貌似全部都在一个GPU？把模型切成了n份，貌似输入层大部分情况会被切到一个GPU上？

you are right. fixed.

helinwang · 2017-09-04T21:43:05Z

paddle/framework/multigpu.md

+
+Besides, it needs interfaces to synchronize model update with each other, and issue/merge model from different GPU Cards. 
+
+## Implement


Implement -> Implementation

helinwang · 2017-09-04T21:49:34Z

paddle/framework/multigpu.md

+
+Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc. 
+
+- **Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.


These two operators are part of the graph, please draw the dependency more clearly.
If the dependency is clear, reader should be able to understand what is the target for the graph initialization, and what is the target for each training step.

helinwang · 2017-09-04T21:51:57Z

paddle/framework/multigpu.md

+
+These two operators need the Multi-GPU context support.
+
+Need to notice that Allreduce operator force GPUs synchronized at that point. Every device only need runs sub-graph in a loop style forever, the whole training process in synchronizing or synchronize style depends on the Allreduce point in the graph.


Don't quite understand what does "synchronizing or synchronize style" mean :)

typo fixed.

typhoonzero · 2017-09-08T07:13:36Z

paddle/framework/multigpu.md

+
+Need to notice that Allreduce operator force GPUs synchronized at that point. Every device only need runs sub-graph in a loop style forever, the whole training process in asynchronous or synchronous mode depends on the Allreduce point in the graph.
+
+For the simplest implement, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.


Move the "Implementation" section here?

I don't think so. Graph converter is also part of our implement. To be unified with dist_train.md, we put it at an independent paragraph to make the document more clear.
I change the first sentence for avoiding ambiguity.

typhoonzero · 2017-09-08T07:14:20Z

paddle/framework/multigpu.md

+
+   *Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, Send, Recv in multiple machines*
+
+   <img src="images/multigpu_before_convert.png" width="300"/>


Picture missing?

It's so Weird. fixed

typhoonzero · 2017-09-08T07:15:22Z

paddle/framework/multigpu.md

+
+2. Control operators between GPUs will be inserted into the graph.
+
+   *Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, Send, Recv in multiple machines*


Since you mentioned "Send, Recv" can you please add a reference link to these design docs?

Thanks for the reminding! Done.

typhoonzero · 2017-09-08T07:16:37Z

paddle/framework/multigpu.md

+### Benefits
+
+- can easily move the optimize sub-graph to parameter server,  multi-GPU feature can be  compatible with distributed support design.
+- easily plug-in with NCCL2 library.


Reference the NCCL library URL, please.

helinwang · 2017-09-13T22:54:03Z

We need a "To Be Decided" section describing "Explicit between send / recv vs. implicit copy on use".
At the beginning of training, the framework needs to issue the same sub-graph to every GPU in Data Parallelism, or different sub-graph in Model Parallelism.

"same sub-graph" is not necessarily true here. Maybe change to "At the beginning of training, The framework will issue a sub-graph to every GPU"
and issue/merge model from

what does "issue" mean?
These two operators need the Multi-GPU context support.

Do we want to allow an OP stay on different devices?
Every device only need runs sub-graph in a loop style forever

This depends on what is user's eval target, if it's a while OP, it will loop forever, otherwise it will not.
In fact, in the way of every GPU optimized full batch of data, wasted (n-1) GPU compute resources. We will enhance it in the next stage.

Do we need to enhance it? I think it wastes computation, but saves one round of communication.

typhoonzero · 2017-09-14T01:29:34Z

Same questions with @helinwang :

We mentioned both GPU data parallelism and model parallelism, and it seems that we are going to implement GPU data parallelism first. Need to point out this?

dzhwinter · 2017-12-11T14:59:57Z

It should be an NCCL based design doc only. Thank you for the reviewing, guys!

dzhwinter · 2017-12-11T15:01:40Z

The Data parallelism and Model parallelism, the confusion part has been removed, and add the Allreduce section detail.

helinwang · 2017-12-11T23:20:08Z

doc/design/paddle_nccl.md

+
+As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
+
+- **AllReduce2**


I think we need to decide on one all reduce OP. supporting two different OP for the same purpose is just too much labor.

I am more leaning towards implementing our own AllReduce, since AllReduce2 adds one more dependency: NCCL2, and NCCL2 is closed sourced.

AllReduce2 is a composed operator write by hand. We only use Reduce operator to implement the AllReduce2.

Now we already have changed to NCCL2 in paddle. Not one more dependency.

I see, thanks.

Since there is already AllReduce, do we need another AllReduce2? For the reasons mentioned above.

Yeah, we only need the AllReduce2, actually. I write down the AllReduce2 just for avoiding people to misunderstand with NCCL built-in AllReduce.

Should I remove the AllReduce description and leave AllReduce2 alone?

How do you think about calling it AllReduce? It's a PaddlePaddle OP, and there is no AllReduce1, so we probably should not name it as AllReduce2?

Yancey1989 · 2017-12-12T05:14:48Z

doc/design/paddle_nccl.md

+- **AllReduce2**
+If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, AllReduce will only utilize the communicate resource during synchronization, then update the gradient will be a seperated phase. In fact, we can amortize the update gradient time cost into the communicating phase.
+- Every parameter has its root card. That card will call **Reduce** operator and collect the gradients from GPUs.
+- The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs.


Just a personal question, should we add a field device_id in Var in the protobuf or NCCL would do this by itself?

It's still a controversial topic in our design, it's not determined by NCCL. So we can leave that discussion in the parallel with multi-device topic.

Yancey1989 · 2017-12-14T08:18:45Z

doc/design/paddle_nccl.md

+
+- **AllReduce**
+  Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is
+1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs.


Maybe we could introduce how to distribute the parameters(round-robin, hash or user-specified)?

No, that's another problem coupled with parallel.do , @tonyyang-svail is working on it.

Yancey1989 · 2017-12-14T08:27:02Z

doc/design/paddle_nccl.md

+
+## Motivation
+
+NCCL is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. [NCCL](https://developer.nvidia.com/nccl). With NCCL library, we can easily accelerate the training in parallel. 


... PCIe and NVLink high-speed interconnect. NCCL. With NCCL library, we can easily accelerate the training in parallel.

Maybe move the linker to the front of the sentence?

NCCL is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs.

jacquesqiao · 2017-12-14T11:18:45Z

doc/design/paddle_nccl.md

+
+### Graph Converter
+
+To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the graph converter converts the user defined operation graph into sub-graphs to be executed on different devices.


graph converter => transpiler

jacquesqiao · 2017-12-14T11:20:40Z

doc/design/paddle_nccl.md

+As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.
+
+- **AllReduce**
+  Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is


NCCL2 also support ring-base AllReduce. see https://github.com/PaddlePaddle/Paddle/wiki/NCCL2-Survey

这个并不一样，我们需要的不仅是ring-based AllReduce. NCCL2 AllReduce只支持sum, max这类简单操作，我们需要在其中做优化。

Yancey1989

LGTM, and maybe @helinwang would review this PR again.

dzhwinter changed the title ~~Multigpu~~ Multigpu Feature Aug 30, 2017

helinwang reviewed Sep 4, 2017

View reviewed changes

dzhwinter added 2 commits September 4, 2017 23:21

fix typo, rewrite graph

b317cbf

fix typo, rewrite graph

dbaaa49

dzhwinter force-pushed the multigpu branch from a797561 to dbaaa49 Compare September 5, 2017 06:35

dzhwinter added 2 commits September 5, 2017 09:57

rewrite graph

c117185

rewrite graph

c8701bd

dzhwinter assigned Superjomn, reyoung, helinwang, jacquesqiao and JiayiFeng Sep 5, 2017

typhoonzero reviewed Sep 8, 2017

View reviewed changes

helinwang mentioned this pull request Sep 8, 2017

How the scope should be implemented in distribute environment. #3825

Closed

QiJune added the Block design label Sep 13, 2017

dzhwinter added 3 commits September 13, 2017 13:40

"redraw the graph"

1e5302c

"redo the graph"

e0a8b59

Merge remote-tracking branch 'origin/develop' into multigpu

1c63771

Merge remote-tracking branch 'origin/develop' into multigpu

ddc2587

dzhwinter force-pushed the multigpu branch from e0a8b59 to 4118782 Compare October 12, 2017 04:28

dzhwinter mentioned this pull request Oct 16, 2017

Feature/multigpu #4838

Merged

5 tasks

dzhwinter force-pushed the multigpu branch from 4118782 to ddc2587 Compare October 16, 2017 19:44

dzhwinter added 3 commits December 11, 2017 03:08

Merge remote-tracking branch 'origin/develop' into feature/nccl_doc

988a4a6

"add NCCL multi-GPU design doc"

7389ea9

"add manual allreduce"

ebd0cf1

helinwang reviewed Dec 11, 2017

View reviewed changes

Yancey1989 reviewed Dec 12, 2017

View reviewed changes

"remove AllReduce2 comments"

9b16750

Yancey1989 reviewed Dec 14, 2017

View reviewed changes

"fix based on comments"

a2dfabb

jacquesqiao reviewed Dec 14, 2017

View reviewed changes

"fixed based on comment"

a02a68d

Yancey1989 approved these changes Dec 14, 2017

View reviewed changes

helinwang merged commit c52a0bd into PaddlePaddle:develop Dec 14, 2017


		- GPU Model Parallelism

		every GPU have `1/n` part of training data, and only have part of a complete model in GPU memory.


		Besides, it needs interfaces to synchronize model update with each other, and issue/merge model from different GPU Cards.

		## Implement


		Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc.

		- Broadcast. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU.


		These two operators need the Multi-GPU context support.

		Need to notice that Allreduce operator force GPUs synchronized at that point. Every device only need runs sub-graph in a loop style forever, the whole training process in synchronizing or synchronize style depends on the Allreduce point in the graph.


		Need to notice that Allreduce operator force GPUs synchronized at that point. Every device only need runs sub-graph in a loop style forever, the whole training process in asynchronous or synchronous mode depends on the Allreduce point in the graph.

		For the simplest implement, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.


		Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, Send, Recv in multiple machines

		<img src="images/multigpu_before_convert.png" width="300"/>


		2. Control operators between GPUs will be inserted into the graph.

		Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, Send, Recv in multiple machines


		As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`.

		- AllReduce2


		## Motivation

		NCCL is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. [NCCL](https://developer.nvidia.com/nccl). With NCCL library, we can easily accelerate the training in parallel.


		### Graph Converter

		To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the graph converter converts the user defined operation graph into sub-graphs to be executed on different devices.

Multigpu Feature #3769

Multigpu Feature #3769

Conversation

dzhwinter commented Aug 30, 2017 • edited by helinwang Loading

helinwang Sep 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang commented Sep 13, 2017 • edited Loading

typhoonzero commented Sep 14, 2017

dzhwinter commented Dec 11, 2017

dzhwinter commented Dec 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Dec 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquesqiao Dec 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment

dzhwinter commented Aug 30, 2017 •

edited by helinwang

Loading

helinwang Sep 4, 2017 •

edited

Loading

helinwang commented Sep 13, 2017 •

edited

Loading

helinwang Dec 12, 2017 •

edited

Loading

jacquesqiao Dec 14, 2017 •

edited

Loading