-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multigpu Feature #3769
Multigpu Feature #3769
Conversation
paddle/framework/multigpu.md
Outdated
|
||
- GPU Model Parallelism | ||
|
||
every GPU have `1/n` part of training data, and only have part of a complete model in GPU memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
模型并行的数据貌似全部都在一个GPU?把模型切成了n份,貌似输入层大部分情况会被切到一个GPU上?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. fixed.
paddle/framework/multigpu.md
Outdated
|
||
Besides, it needs interfaces to synchronize model update with each other, and issue/merge model from different GPU Cards. | ||
|
||
## Implement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement -> Implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
paddle/framework/multigpu.md
Outdated
|
||
Operators are added to the sub-graphs. Every GPU assigned a role of `rank0`, `rank1` etc. | ||
|
||
- **Broadcast**. Broadcast operator distribute initialized parameter to all the GPUs from the GPU who owns it. e.g. from`rank0` GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two operators are part of the graph, please draw the dependency more clearly.
If the dependency is clear, reader should be able to understand what is the target for the graph initialization, and what is the target for each training step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
paddle/framework/multigpu.md
Outdated
|
||
These two operators need the Multi-GPU context support. | ||
|
||
Need to notice that Allreduce operator force GPUs synchronized at that point. Every device only need runs sub-graph in a loop style forever, the whole training process in synchronizing or synchronize style depends on the Allreduce point in the graph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't quite understand what does "synchronizing or synchronize style" mean :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo fixed.
paddle/framework/multigpu.md
Outdated
|
||
Need to notice that Allreduce operator force GPUs synchronized at that point. Every device only need runs sub-graph in a loop style forever, the whole training process in asynchronous or synchronous mode depends on the Allreduce point in the graph. | ||
|
||
For the simplest implement, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the "Implementation" section here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Graph converter is also part of our implement. To be unified with dist_train.md, we put it at an independent paragraph to make the document more clear.
I change the first sentence for avoiding ambiguity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
paddle/framework/multigpu.md
Outdated
|
||
*Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, Send, Recv in multiple machines* | ||
|
||
<img src="images/multigpu_before_convert.png" width="300"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Picture missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so Weird. fixed
paddle/framework/multigpu.md
Outdated
|
||
2. Control operators between GPUs will be inserted into the graph. | ||
|
||
*Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, Send, Recv in multiple machines* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you mentioned "Send, Recv" can you please add a reference link to these design docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminding! Done.
paddle/framework/multigpu.md
Outdated
### Benefits | ||
|
||
- can easily move the optimize sub-graph to parameter server, multi-GPU feature can be compatible with distributed support design. | ||
- easily plug-in with NCCL2 library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference the NCCL library URL, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
Same questions with @helinwang : We mentioned both GPU data parallelism and model parallelism, and it seems that we are going to implement GPU data parallelism first. Need to point out this? |
It should be an NCCL based design doc only. Thank you for the reviewing, guys! |
The Data parallelism and Model parallelism, the confusion part has been removed, and add the Allreduce section detail. |
doc/design/paddle_nccl.md
Outdated
|
||
As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`. | ||
|
||
- **AllReduce2** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to decide on one all reduce OP. supporting two different OP for the same purpose is just too much labor.
I am more leaning towards implementing our own AllReduce
, since AllReduce2
adds one more dependency: NCCL2, and NCCL2 is closed sourced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllReduce2 is a composed operator write by hand. We only use Reduce
operator to implement the AllReduce2.
Now we already have changed to NCCL2 in paddle. Not one more dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
Since there is already AllReduce
, do we need another AllReduce2
? For the reasons mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we only need the AllReduce2
, actually. I write down the AllReduce2
just for avoiding people to misunderstand with NCCL built-in AllReduce
.
Should I remove the AllReduce
description and leave AllReduce2
alone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you think about calling it AllReduce? It's a PaddlePaddle OP, and there is no AllReduce1, so we probably should not name it as AllReduce2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/paddle_nccl.md
Outdated
- **AllReduce2** | ||
If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, AllReduce will only utilize the communicate resource during synchronization, then update the gradient will be a seperated phase. In fact, we can amortize the update gradient time cost into the communicating phase. | ||
- Every parameter has its root card. That card will call **Reduce** operator and collect the gradients from GPUs. | ||
- The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a personal question, should we add a field device_id
in Var
in the protobuf or NCCL would do this by itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still a controversial topic in our design, it's not determined by NCCL. So we can leave that discussion in the parallel with multi-device topic.
|
||
- **AllReduce** | ||
Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is | ||
1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could introduce how to distribute the parameters(round-robin, hash or user-specified)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's another problem coupled with parallel.do
, @tonyyang-svail is working on it.
doc/design/paddle_nccl.md
Outdated
|
||
## Motivation | ||
|
||
NCCL is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. [NCCL](https://developer.nvidia.com/nccl). With NCCL library, we can easily accelerate the training in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/paddle_nccl.md
Outdated
|
||
### Graph Converter | ||
|
||
To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the graph converter converts the user defined operation graph into sub-graphs to be executed on different devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
graph converter => transpiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`. | ||
|
||
- **AllReduce** | ||
Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NCCL2 also support ring-base AllReduce. see https://github.com/PaddlePaddle/Paddle/wiki/NCCL2-Survey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个并不一样,我们需要的不仅是ring-based AllReduce. NCCL2 AllReduce只支持sum, max这类简单操作,我们需要在其中做优化。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and maybe @helinwang would review this PR again.
here is better for review.
fix #3651