-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design Doc: Distributed Training Architecture #3811
Conversation
d9943d2
to
b46ab76
Compare
b46ab76
to
8ba5f87
Compare
doc/design/fully_static_graph.md
Outdated
loss = pd.op.square(pd.op.sub(pd.op.add(pd.op.mul(x, w), b), y)) | ||
train_op = optimizer.minimize(loss) | ||
for i in range(10000): | ||
paddle.eval(train_op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that our OPs will be enough to handle all possible python logics. It would be perfectly fine that paddle.evel(train_op) can be distributed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment! after more discussion with @wangkuiyi and @reyoung , we think currently we don't have to support all possible Python logics (we could gradually add more such OPs, but we don't need it support all of them in the initial release). Closing this PR now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated the design doc, user can do the training in Python, the session.eval(train_op)
gets distributed.
I like this idea. it would be great if the graph spec can be fully described as a programming language independent schema, which opens lots of possibilities. |
doc/design/fully_static_graph.md
Outdated
|
||
## Background | ||
|
||
There are two paradigms for expressing the computation graph: dynamic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I cannot get the difference between dynamic and static nets. Do you mean static nets are those who don't change with iterations, but dynamic nets do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have modified the design doc, this confusion no longer exists.
doc/design/fully_static_graph.md
Outdated
|
||
## Abstract | ||
|
||
We propose the *fully static graph* rule: training and inference must |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the relationship between "fully specified nets" and "static nets"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have modified the design doc, this confusion no longer exists.
doc/design/fully_static_graph.md
Outdated
|
||
The user can still use Python to achieve the same result for | ||
convenience when experimenting locally, but the distributed training | ||
will not support Python. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that users cannot program distributed PaddlePaddle jobs in Python? I am afraid that most users might not agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have modified the design doc, now the user can use Python.
doc/design/fully_static_graph.md
Outdated
|
||
There are two paradigms for expressing the computation graph: dynamic | ||
and static. The dynamic paradigm constructs the graph on the fly: | ||
every time `eval` is called, a new graph is created. The static |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know what is eval
until I read the Python code snippet few paragraphs below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have modified the design doc, this confusion no longer exists.
doc/design/fully_static_graph.md
Outdated
paddle.eval(train_op) | ||
``` | ||
|
||
The above code can only run locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this comparison makes sense when we allow users to write the train loop. Otherwise, as long as each worker node implements the loop, and users provide the loss
as in this example, it would run both locally and distributedly.
Do we need to allow users to write ths train loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have modified the design doc, we allow user to write the train loop for flexibility, but once we provide while OP, user do not need to write the train loop in Python.
doc/design/fully_static_graph.md
Outdated
@@ -0,0 +1,122 @@ | |||
# Design Doc: Fully Static Graph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing this design together with @reyoung, I noticed that this PR doesn't provide sufficient background information for readers to understand the problem.
As a design doc, this PR doesn't tell what is the problem. It directly goes to a conclusion. This is like firing a missile with no target.
It looks to me that this PR is trying to address the choice -- should we allow users to write the train loop, in hope that the PaddlePaddle program doesn't change much when running locally and distributedly.
To help readers understand this problem better, we can start from how a DL program looks like:
reader = new_reader("some_files")
for iterations:
minibatch = reader.next()
train(a_network, minibatch)
If we allow users to write the train loop, the users should also write the creation of the reader because the loop uses it. This means that users specify the reading and feeding of data. However, it doesn't make sense all workers run this loop over the same data -- what we want is that each worker loads part of the data.
It is true that most DL frameworks make the reader depends on a 0-based successive "worker id", e.g., distributed PyTorch, and this approach works with HPC clusters. However, in a fault-tolerable distributed computing environment, there is no such a worker id.
This is the reason that we might need to restrict the user from being able to write the train loop. And if we cannot make this restriction, how should we solve above problem. It seems that this document presents a solution -- to wrap the train loop into a PaddlePaddle operator, other than a Python loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking time to read the PR and write this very valuable feedback! It's very valuable to me. I will improve this PR to provide detailed background and reason behind the solution this PR proposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the feedback! I provided detailed background and reason in the updated PR.
doc/design/fully_static_graph.md
Outdated
The dynamic graph has the advantage of being flexible but is highly | ||
dependent on the host language (most commonly Python). The static | ||
graph is not as flexible, but more optimization can be done since the | ||
graph is known before computing happens. PaddlePaddle is using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you that the fault recovery is easier to do if we always use a serializable computational graph and I also agree that the dist-paddle should never know Python
and how that computational graph generated.
But I prefer that The dist-paddle
just run that graph at once and do not care whether that graph
contains a for-loop
or not. It should be determined by the client-side. Because
- It is clean and straightforward. The
dist-paddle
just run a graph. The user program just generates the graph and invokedist-paddle
. - It is flexible. The user can create a whole graph or make many graphs for one training process (think about a pre-train network and a real network run in the same training process).
If we implement our framework like that, we should make the user experience in local mode as same as in distributed mode. We should consider the following things
- In local mode, the running logic should be same. Instead of sending the computation graph to multiple distributed Paddle, we just send it to local Paddle directly.
- In local mode, we can
find
orcreate
a variable in Python. For distributed Paddle, where does the userfind
a variable? Maybe adistributed scope
? - Maybe we should not let the user define a
data reader
in Python. Users can only read their data bydata_reader
operator, which convert files into mini-batches. But there is a difference between local mode and distributed mode ofdata_reader
operator is reading from local filesystem or reading from a distributed filesystem. And one more thing for the distributeddata_reader
operator is that the multiple nodes should read the whole epoch of data but each node should read the different part of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very valuable suggestion! I will reconsider the solutions and discuss with you offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Updated.
Currently we don't have to support all possible Python logics (we could gradually add more such OPs, but we don't need it support all of them in the initial release). Closing this PR now. |
Re-opened because make everything as an Op is a viable solution, we just don't have to support it in the first release. Instead, we can gradually add more Ops. |
Absolutely agree with the basic ideas of this doc, I'd like to discuss some details including some thing we discussed today through the meeting.
People generally have 3 things to configure to run:
|
Yes! User should not need to deal with it, that's why having a graph representation is important. It's an important feature that makes PaddlePaddle much easier to use than TensorFlow when doing multiple GPU or multiple node training.
Agree, this PR is proposing to submit the job from commandline, but after discussing with @reyoung and seeing your comment, I agree it makes a lot of sense to be able to submit the job from Python. One very interesting way from @reyoung 's idea is very similar to
Yes, the graph (one proto message) can be serialized. Besides,
Exactly, in remote training, data feeder must be an OP, otherwise it's too inefficient.
Maybe we can do things similar to TensorFlow: sess = tf.remoteSession(settings...)
sess.run(...) |
8ba5f87
to
3a78716
Compare
3a78716
to
2234d01
Compare
2234d01
to
8792ad0
Compare
For more information about `Session`, please | ||
see [Design Doc: Session](./session.md). | ||
|
||
### PaddlePaddle Converter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should converter
be an independent process or a function in C++
?
class BlockConverter {
public:
virtual Block* Convert(const Block& block) const = 0;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in general, Backward
and Optimize
are converter
, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think converter should be a library. On top of it we can wrap a grpc service.
On the cloud it could be a standalone process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in general, Backward and Optimize are converter, too.
Do you mean backward and optimizer need to be added to the graph and it should be done by the converter? I think it makes a lot of sense. In this way the converter input is what has defined by the user, which is unprocessed raw graph, giving the converter the most flexibility.
optimizer = paddle.optimizer.SGD(cost, learning_rate=0.01) | ||
session = paddle.session.NewRemote(num_trainer=3, num_ps=2, GPU_per_trainer=1) | ||
for i in range(1000): | ||
_, cost_val = session.eval(target=[cost, optimizer]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
targets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_, cost_val = session.eval(targets=[optimizer, cost])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Will change when I come back. Now getting online with mobile phone...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
PaddlePaddle should be able to modify the nerual network computation | ||
definition to support model parallelism automatically. However, the | ||
computation is only specified in Python code, and PaddlePaddle can not | ||
modify Python code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can paddle support model parallelism automatically. Does users need to specify some operators in corresponding device? The topology is describe by python code, then convert to proto message. Paddle can modify proto message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question, I think the converter should be able to make placement on device and node automatically. Optionally user can specify device constraints that the auto placement should satisfy.
I think we should not allow user to specify the node constraint, because it's too tedious and if we want the computation to be scalable, the model should not constraint itself to run on a fixed number of machines. Furthermore, I suspect even if we allow the user to specify the node constraint, there is still a lot of work to do. We may just take some more time to do the placement automatically.
General placement (model parallel) is not easy, we need more time to do it. I think we should not support model parallel at the beginning. Since PaddlePaddle 0.10.0 does not support it too.
For more information about `Session`, please | ||
see [Design Doc: Session](./session.md). | ||
|
||
### PaddlePaddle Converter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, the converter is a module run Blocks multi-passes without allocating the real memory. For example, in tensorflow, the first pass eliminates some links and rewrite the graph. The second pass inserts necessary pairs of send/recv operator and partitions graph into sub-graphs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct, the converter will traverse the block multiple passes.
|
||
#### session.eval | ||
|
||
As shown in the graph, `session.eval` sends the IR and the evaluation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python session only is a wrapper of c++ session module, which exported interfaces of the session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added:
The Python session
is a wrapper of the C++ Session
class. For more information about Session
, please see Design Doc: Session.
d0c5077
to
a84bc86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM currently. I think we can merge this and #3993 ASAP, so we can start detailed designs.
Here is easier to review.
The "Design Doc: Distributed Training Architecture" replaces "Design Doc: Fully Static Graph".
The refactor give us a great opportunity to have a better distributed training architecture, here is the design doc.