-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Fluid Compiler design doc #7178
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,110 @@ | ||||||||||||||||||
# PaddlePaddle Fluid: Towards a Compiled Programming Language | ||||||||||||||||||
|
||||||||||||||||||
As described in [fluid.md](fluid.md), when a Fluid application program | ||||||||||||||||||
runs, it generates a `ProgramDesc` protobuf message as an intermediate | ||||||||||||||||||
representation of itself. The C++ class `Executor` can run this | ||||||||||||||||||
protobuf message as an interpreter. This article describes the Fluid | ||||||||||||||||||
compiler. | ||||||||||||||||||
|
||||||||||||||||||
![](fluid-compiler.png) | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need |
||||||||||||||||||
|
||||||||||||||||||
## ProgramDesc | ||||||||||||||||||
|
||||||||||||||||||
Before we go deeper into the idea of compiled language, let us take a | ||||||||||||||||||
look at a simple example Fluid application. | ||||||||||||||||||
|
||||||||||||||||||
```python | ||||||||||||||||||
import "fluid" | ||||||||||||||||||
|
||||||||||||||||||
func paddlepaddle() { | ||||||||||||||||||
X = fluid.read(...) | ||||||||||||||||||
W = fluid.Tensor(...) | ||||||||||||||||||
Y = fluid.mult(X, W) | ||||||||||||||||||
} | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
This program consists of a [block](block.md) of three operators -- | ||||||||||||||||||
`read`, `assign`, and `mult`. Its `ProgramDesc` message looks like | ||||||||||||||||||
the following | ||||||||||||||||||
|
||||||||||||||||||
```protobuf | ||||||||||||||||||
message ProgramDesc { | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From my understanding the ProgramDesc is a intermediate representation (IR), currently we have Python as frontend that generates the IR, and this PR discusses a cpp code backend. I think having Python as a frontend is a huge pain. In my opinion the benefit of Python in the machine learning field is:
In our case we are benefit from neither of these two points:
And we are trapped in the Python grammar. I think a better way is to invent our own language. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree and I believe that a new language is the future. |
||||||||||||||||||
block[0] = Block { | ||||||||||||||||||
vars = [X, W, Y], | ||||||||||||||||||
ops = [ | ||||||||||||||||||
read(output = X) | ||||||||||||||||||
assign(input = ..., output = W) | ||||||||||||||||||
mult(input = {X, W}, output = Y) | ||||||||||||||||||
], | ||||||||||||||||||
} | ||||||||||||||||||
} | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
## Transpilers | ||||||||||||||||||
|
||||||||||||||||||
We can write a transpiler program that takes a `ProgramDesc`, e.g., | ||||||||||||||||||
the above one, and outputs another `ProgramDesc`. Let us take some | ||||||||||||||||||
examples: | ||||||||||||||||||
|
||||||||||||||||||
1. *Memory optimization transpiler*: We can write a transpiler that | ||||||||||||||||||
inserts some `FreeMemoryOp`s in the above example `ProgramDesc` so | ||||||||||||||||||
to free memory early, before the end of an iteration, so to keep a | ||||||||||||||||||
small memory footprint. | ||||||||||||||||||
|
||||||||||||||||||
1. *Distributed training transpiler*: We can write a transpiler that | ||||||||||||||||||
converts a`ProgramDesc` into its distributed version of two | ||||||||||||||||||
`ProgramDesc`s -- one for running by the trainer processes and the | ||||||||||||||||||
other for the parameter server. | ||||||||||||||||||
|
||||||||||||||||||
In the rest of this article, we talk about a special kind of | ||||||||||||||||||
transpiler, *Native code generator*, which takes a `ProgramDesc` and | ||||||||||||||||||
generates a `.cu` (or `.cc`) file, which could be built by C++ | ||||||||||||||||||
compilers (gcc, nvcc, icc) into binaries. | ||||||||||||||||||
|
||||||||||||||||||
## Native Code Generator | ||||||||||||||||||
|
||||||||||||||||||
For the above example, the native code generator transpiler, say, the | ||||||||||||||||||
CUDA code generator, should generate a `main` function: | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In most case, user may need a library such as a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point! I prefer that our transpiler generates source code only, not reusing some .a/.so files, so to simplify the building process. To be precise, if the transpiler generates source code only, the general workflow would be
Otherwise, if we try to reuse the .a/.so files
It is error-prone because there is a chance we are using different compilers for It is true that the generated code might depend on third-party libraries, so our transpiler might also need to generate build commands, including dependencies. |
||||||||||||||||||
|
||||||||||||||||||
```c++ | ||||||||||||||||||
void main() { | ||||||||||||||||||
auto X = fluid_cuda_read(...); | ||||||||||||||||||
auto W = fluid_cuda_create_tensor(...); | ||||||||||||||||||
auto Y = fluid_cuda_mult(X, W); | ||||||||||||||||||
} | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
and the definitions of functions `fluid_cuda_read`, | ||||||||||||||||||
`fluid_cuda_create_tensor`, and `fluid_cuda_mult`. Please be aware | ||||||||||||||||||
that each function could just define a C++ instance of an operator and | ||||||||||||||||||
run it. For example | ||||||||||||||||||
|
||||||||||||||||||
```c++ | ||||||||||||||||||
paddle::Tensor fluid_cuda_read(...) { | ||||||||||||||||||
paddle::Tensor t; | ||||||||||||||||||
paddle::operator::Read r(&t, ...); | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we copy the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to copy the source files instead of reusing .a/.so files due to reasons in #7178 (comment) |
||||||||||||||||||
r.Run(); | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh nevermind, I mixed up the 2 things. I think post this code generation, the executor will indeed run these as usual i guess. |
||||||||||||||||||
return t; | ||||||||||||||||||
} | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
For computational operators that have multiple *kernels*, each for a | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we also consider the possibility of having a default fallback device in case some operator cannot run on a device. For example, we might have some CPU only operators in that case our transpiler should make sure that it generates CPU code for that op even though the rest of the native code might be CUDA code? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @abhinavarora Paddle/paddle/framework/data_transform.h Lines 33 to 40 in eee6264
If the target machine does not have the target device, it will try to use naive implement kernel (say CPU kernel) instead of terminated. In the fluid overview design, there will be a runtime .a link to target program. In my view, the runtime library will solve the fallback problem, not the transpiler does.
|
||||||||||||||||||
specific hardware platform, for example, the `mult` operator, the | ||||||||||||||||||
generated code should call its CUDA kernel: | ||||||||||||||||||
|
||||||||||||||||||
```c++ | ||||||||||||||||||
paddle::Tensor fluid_cuda_mult(const paddle::Tensor& a, | ||||||||||||||||||
const paddle::Tensor& b) { | ||||||||||||||||||
paddle::Tensor t; | ||||||||||||||||||
paddle::operator::Mult m(a, b, ...); | ||||||||||||||||||
Mult.Run(cuda_context); | ||||||||||||||||||
} | ||||||||||||||||||
``` | ||||||||||||||||||
|
||||||||||||||||||
where `cuda_context` could be a global variable of type | ||||||||||||||||||
`paddle::CUDADeviceContext`. | ||||||||||||||||||
|
||||||||||||||||||
## Multi-Block Code Generation | ||||||||||||||||||
|
||||||||||||||||||
Most Fluid application programs may have more than one blocks. To | ||||||||||||||||||
execute them, we need to trace [scopes](scope.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain -> explained