Design Doc: Distributed Training Architecture #3811

helinwang · 2017-09-01T22:13:43Z

Here is easier to review.

The "Design Doc: Distributed Training Architecture" replaces "Design Doc: Fully Static Graph".

The refactor give us a great opportunity to have a better distributed training architecture, here is the design doc.

emailweixu · 2017-09-01T23:31:23Z

doc/design/fully_static_graph.md

+loss = pd.op.square(pd.op.sub(pd.op.add(pd.op.mul(x, w), b), y))
+train_op = optimizer.minimize(loss)
+for i in range(10000):
+	paddle.eval(train_op)


I am not sure that our OPs will be enough to handle all possible python logics. It would be perfectly fine that paddle.evel(train_op) can be distributed.

Thanks for the comment! after more discussion with @wangkuiyi and @reyoung , we think currently we don't have to support all possible Python logics (we could gradually add more such OPs, but we don't need it support all of them in the initial release). Closing this PR now.

Thanks! Updated the design doc, user can do the training in Python, the session.eval(train_op) gets distributed.

putcn · 2017-09-02T03:40:25Z

I like this idea. it would be great if the graph spec can be fully described as a programming language independent schema, which opens lots of possibilities.

wangkuiyi · 2017-09-02T17:32:02Z

doc/design/fully_static_graph.md

+
+## Background
+
+There are two paradigms for expressing the computation graph: dynamic


Here I cannot get the difference between dynamic and static nets. Do you mean static nets are those who don't change with iterations, but dynamic nets do?

Thanks! I have modified the design doc, this confusion no longer exists.

wangkuiyi · 2017-09-02T17:32:31Z

doc/design/fully_static_graph.md

+
+## Abstract
+
+We propose the *fully static graph* rule: training and inference must


What is the relationship between "fully specified nets" and "static nets"?

Thanks! I have modified the design doc, this confusion no longer exists.

wangkuiyi · 2017-09-02T17:33:26Z

doc/design/fully_static_graph.md

+
+The user can still use Python to achieve the same result for
+convenience when experimenting locally, but the distributed training
+will not support Python.


Does this mean that users cannot program distributed PaddlePaddle jobs in Python? I am afraid that most users might not agree.

Thanks! I have modified the design doc, now the user can use Python.

wangkuiyi · 2017-09-02T17:34:37Z

doc/design/fully_static_graph.md

+
+There are two paradigms for expressing the computation graph: dynamic
+and static. The dynamic paradigm constructs the graph on the fly:
+every time `eval` is called, a new graph is created. The static


I didn't know what is eval until I read the Python code snippet few paragraphs below.

Thanks! I have modified the design doc, this confusion no longer exists.

wangkuiyi · 2017-09-02T17:45:42Z

doc/design/fully_static_graph.md

+	paddle.eval(train_op)
+```
+
+The above code can only run locally.


I see this comparison makes sense when we allow users to write the train loop. Otherwise, as long as each worker node implements the loop, and users provide the loss as in this example, it would run both locally and distributedly.

Do we need to allow users to write ths train loop?

Thanks! I have modified the design doc, we allow user to write the train loop for flexibility, but once we provide while OP, user do not need to write the train loop in Python.

wangkuiyi · 2017-09-02T18:16:38Z

doc/design/fully_static_graph.md

@@ -0,0 +1,122 @@
+# Design Doc: Fully Static Graph


After discussing this design together with @reyoung, I noticed that this PR doesn't provide sufficient background information for readers to understand the problem.

As a design doc, this PR doesn't tell what is the problem. It directly goes to a conclusion. This is like firing a missile with no target.

It looks to me that this PR is trying to address the choice -- should we allow users to write the train loop, in hope that the PaddlePaddle program doesn't change much when running locally and distributedly.

To help readers understand this problem better, we can start from how a DL program looks like:

reader = new_reader("some_files") for iterations: minibatch = reader.next() train(a_network, minibatch)

If we allow users to write the train loop, the users should also write the creation of the reader because the loop uses it. This means that users specify the reading and feeding of data. However, it doesn't make sense all workers run this loop over the same data -- what we want is that each worker loads part of the data.

It is true that most DL frameworks make the reader depends on a 0-based successive "worker id", e.g., distributed PyTorch, and this approach works with HPC clusters. However, in a fault-tolerable distributed computing environment, there is no such a worker id.

This is the reason that we might need to restrict the user from being able to write the train loop. And if we cannot make this restriction, how should we solve above problem. It seems that this document presents a solution -- to wrap the train loop into a PaddlePaddle operator, other than a Python loop.

Thanks for taking time to read the PR and write this very valuable feedback! It's very valuable to me. I will improve this PR to provide detailed background and reason behind the solution this PR proposes.

Thanks again for the feedback! I provided detailed background and reason in the updated PR.

reyoung · 2017-09-02T18:40:18Z

doc/design/fully_static_graph.md

+The dynamic graph has the advantage of being flexible but is highly
+dependent on the host language (most commonly Python). The static
+graph is not as flexible, but more optimization can be done since the
+graph is known before computing happens. PaddlePaddle is using the


I agree with you that the fault recovery is easier to do if we always use a serializable computational graph and I also agree that the dist-paddle should never know Python and how that computational graph generated.

But I prefer that The dist-paddle just run that graph at once and do not care whether that graph contains a for-loop or not. It should be determined by the client-side. Because

It is clean and straightforward. The dist-paddle just run a graph. The user program just generates the graph and invoke dist-paddle.

It is flexible. The user can create a whole graph or make many graphs for one training process (think about a pre-train network and a real network run in the same training process).

If we implement our framework like that, we should make the user experience in local mode as same as in distributed mode. We should consider the following things

In local mode, the running logic should be same. Instead of sending the computation graph to multiple distributed Paddle, we just send it to local Paddle directly.

In local mode, we can find or create a variable in Python. For distributed Paddle, where does the user find a variable? Maybe a distributed scope?

Maybe we should not let the user define a data reader in Python. Users can only read their data by data_reader operator, which convert files into mini-batches. But there is a difference between local mode and distributed mode of data_reader operator is reading from local filesystem or reading from a distributed filesystem. And one more thing for the distributed data_reader operator is that the multiple nodes should read the whole epoch of data but each node should read the different part of them.

Thanks for the very valuable suggestion! I will reconsider the solutions and discuss with you offline.

Thanks! Updated.

helinwang · 2017-09-07T23:52:57Z

Currently we don't have to support all possible Python logics (we could gradually add more such OPs, but we don't need it support all of them in the initial release). Closing this PR now.
There are many valuable comment in this PR, we can come back and look any time in the future.

helinwang · 2017-09-08T02:36:20Z

Re-opened because make everything as an Op is a viable solution, we just don't have to support it in the first release. Instead, we can gradually add more Ops.

typhoonzero · 2017-09-08T05:35:13Z

Absolutely agree with the basic ideas of this doc, I'd like to discuss some details including some thing we discussed today through the meeting.

Representing a graph as a protobuf message. The graph which user design could be different for local training and multi-GPU training and distributed multi-GPU training. But we shouldn't let users care about how. We must "optimize" and re-structuring the graph before it can be run.
Distributed training and local training have different code would be fine I think, and probably more simple. For local we run pd.eval(graph, local_settings, ...) and for cluster we run pd.submit(graph, cluster_settings, ...). The "graph"s are the same, but settings vary. Then we can add "send" and "recv" ops automatically to the graph for cluster training according to the settings.

People generally have 3 things to configure to run:

graph, which can be defined with Python and be serialized.
data feeder, which can also be an "op".
settings including:
- hyper parameters
- devices to use
- resources(memory, GPU memory, threads, num_nodes, communication ports etc.)
- graph optimization strategies

helinwang · 2017-09-08T18:29:52Z

Representing a graph as a protobuf message. The graph which user design could be different for local training and multi-GPU training and distributed multi-GPU training. But we shouldn't let users care about how. We must "optimize" and re-structuring the graph before it can be run.

Yes! User should not need to deal with it, that's why having a graph representation is important. It's an important feature that makes PaddlePaddle much easier to use than TensorFlow when doing multiple GPU or multiple node training.

Distributed training and local training have different code would be fine I think, and probably more simple. For local we run pd.eval(graph, local_settings, ...) and for cluster we run pd.submit(graph, cluster_settings, ...). The "graph"s are the same, but settings vary. Then we can add "send" and "recv" ops automatically to the graph for cluster training according to the settings.

Agree, this PR is proposing to submit the job from commandline, but after discussing with @reyoung and seeing your comment, I agree it makes a lot of sense to be able to submit the job from Python. One very interesting way from @reyoung 's idea is very similar to pd.submit(graph, cluster_settings, ...), his idea is we can do pd.eval(graph, cluster_settings, ...) to send a request to Paddle Cloud (a controller process which controls all Paddle runtimes) for graph evaluation. The graph and parameters are cached on Paddle Cloud, so multiple pd.eval(graph, cluster_settings, ...) can be very efficient. In this way the "for OP" is not an requirement but an improvement (user can still use Python for loop with pd.eval as loop body, but using "for OP" would be more efficient).

graph, which can be defined with Python and be serialized.

Yes, the graph (one proto message) can be serialized. Besides, pd.submit(graph, cluster_settings, ...) or pd.eval(graph, cluster_settings, ...) should be able to take input tensor as argument and returns tensor, so tensor needs to be able to be serialized.

data feeder, which can also be an "op".

Exactly, in remote training, data feeder must be an OP, otherwise it's too inefficient.

settings including: ...

Maybe we can do things similar to TensorFlow:

sess = tf.remoteSession(settings...)
sess.run(...)

reyoung · 2017-09-15T22:45:49Z

doc/design/refactor/distributed_architecture.md

+For more information about `Session`, please
+see [Design Doc: Session](./session.md).
+
+### PaddlePaddle Converter


Should converter be an independent process or a function in C++?

class BlockConverter { public: virtual Block* Convert(const Block& block) const = 0; }

Also, in general, Backward and Optimize are converter, too.

I think converter should be a library. On top of it we can wrap a grpc service.
On the cloud it could be a standalone process.

Also, in general, Backward and Optimize are converter, too.

Do you mean backward and optimizer need to be added to the graph and it should be done by the converter? I think it makes a lot of sense. In this way the converter input is what has defined by the user, which is unprocessed raw graph, giving the converter the most flexibility.

Superjomn · 2017-09-16T02:17:10Z

doc/design/refactor/distributed_architecture.md

+optimizer = paddle.optimizer.SGD(cost, learning_rate=0.01)
+session = paddle.session.NewRemote(num_trainer=3, num_ps=2, GPU_per_trainer=1)
+for i in range(1000):
+	_, cost_val = session.eval(target=[cost, optimizer])


_, cost_val = session.eval(targets=[optimizer, cost])

Thanks! Will change when I come back. Now getting online with mobile phone...

QiJune · 2017-09-18T02:29:35Z

doc/design/refactor/distributed_architecture.md

+PaddlePaddle should be able to modify the nerual network computation
+definition to support model parallelism automatically. However, the
+computation is only specified in Python code, and PaddlePaddle can not
+modify Python code.


How can paddle support model parallelism automatically. Does users need to specify some operators in corresponding device? The topology is describe by python code, then convert to proto message. Paddle can modify proto message.

Great question, I think the converter should be able to make placement on device and node automatically. Optionally user can specify device constraints that the auto placement should satisfy.

I think we should not allow user to specify the node constraint, because it's too tedious and if we want the computation to be scalable, the model should not constraint itself to run on a fixed number of machines. Furthermore, I suspect even if we allow the user to specify the node constraint, there is still a lot of work to do. We may just take some more time to do the placement automatically.

General placement (model parallel) is not easy, we need more time to do it. I think we should not support model parallel at the beginning. Since PaddlePaddle 0.10.0 does not support it too.

dzhwinter · 2017-09-19T00:22:42Z

doc/design/refactor/distributed_architecture.md

+For more information about `Session`, please
+see [Design Doc: Session](./session.md).
+
+### PaddlePaddle Converter


In my mind, the converter is a module run Blocks multi-passes without allocating the real memory. For example, in tensorflow, the first pass eliminates some links and rewrite the graph. The second pass inserts necessary pairs of send/recv operator and partitions graph into sub-graphs.

Yes, you are correct, the converter will traverse the block multiple passes.

dzhwinter · 2017-09-19T00:25:49Z

doc/design/refactor/distributed_architecture.md

+
+#### session.eval
+
+As shown in the graph, `session.eval` sends the IR and the evaluation


The python session only is a wrapper of c++ session module, which exported interfaces of the session.

Done, added:
The Python session is a wrapper of the C++ Session class. For more information about Session, please see Design Doc: Session.

typhoonzero

LGTM currently. I think we can merge this and #3993 ASAP, so we can start detailed designs.

helinwang requested review from Superjomn, reyoung, Yancey1989, wangkuiyi, jacquesqiao, dzhwinter, qingqing01, JiayiFeng, emailweixu and typhoonzero September 1, 2017 22:13

helinwang mentioned this pull request Sep 1, 2017

Evaluators need to store variable(tensor) in an op #3808

Closed

helinwang force-pushed the graph_runtime branch 3 times, most recently from d9943d2 to b46ab76 Compare September 1, 2017 22:20

helinwang requested a review from lcy-seso September 1, 2017 22:25

helinwang force-pushed the graph_runtime branch from b46ab76 to 8ba5f87 Compare September 1, 2017 22:25

emailweixu reviewed Sep 1, 2017

View reviewed changes

wangkuiyi reviewed Sep 2, 2017

View reviewed changes

reyoung reviewed Sep 2, 2017

View reviewed changes

helinwang closed this Sep 7, 2017

helinwang reopened this Sep 8, 2017

typhoonzero mentioned this pull request Sep 11, 2017

Design Doc: Session #3993

Merged

QiJune added the Block design label Sep 13, 2017

helinwang force-pushed the graph_runtime branch from 8ba5f87 to 3a78716 Compare September 15, 2017 22:21

helinwang changed the title ~~Design Doc: Fully Static Graph~~ Design Doc: Distributed Training Architecture Sep 15, 2017

Design Doc: Fully Static Graph

253d3c4

helinwang force-pushed the graph_runtime branch from 3a78716 to 2234d01 Compare September 15, 2017 22:34

Design Doc: Distributed Training Architecture

8792ad0

helinwang force-pushed the graph_runtime branch from 2234d01 to 8792ad0 Compare September 15, 2017 22:36

reyoung reviewed Sep 15, 2017

View reviewed changes

Superjomn reviewed Sep 16, 2017

View reviewed changes

QiJune reviewed Sep 18, 2017

View reviewed changes

dzhwinter reviewed Sep 19, 2017

View reviewed changes

fix according to comments

a84bc86

helinwang force-pushed the graph_runtime branch from d0c5077 to a84bc86 Compare September 25, 2017 20:30

typhoonzero approved these changes Sep 26, 2017

View reviewed changes

helinwang merged commit 1c0a4c9 into PaddlePaddle:develop Sep 26, 2017

helinwang deleted the graph_runtime branch September 26, 2017 04:47


		## Background

		There are two paradigms for expressing the computation graph: dynamic


		## Abstract

		We propose the fully static graph rule: training and inference must


		#### session.eval

		As shown in the graph, `session.eval` sends the IR and the evaluation

Design Doc: Distributed Training Architecture #3811

Design Doc: Distributed Training Architecture #3811

Conversation

helinwang commented Sep 1, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang Sep 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

putcn commented Sep 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang commented Sep 7, 2017 • edited Loading

helinwang commented Sep 8, 2017

typhoonzero commented Sep 8, 2017

helinwang commented Sep 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang Sep 25, 2017 • edited Loading

Choose a reason for hiding this comment

typhoonzero left a comment

Choose a reason for hiding this comment

helinwang commented Sep 1, 2017 •

edited

Loading

helinwang Sep 7, 2017 •

edited

Loading

helinwang commented Sep 7, 2017 •

edited

Loading

helinwang commented Sep 8, 2017 •

edited

Loading

helinwang Sep 25, 2017 •

edited

Loading