Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design Doc: Distributed Training Architecture #3811

Merged
merged 3 commits into from
Sep 26, 2017

Conversation

helinwang
Copy link
Contributor

@helinwang helinwang commented Sep 1, 2017

Here is easier to review.

The "Design Doc: Distributed Training Architecture" replaces "Design Doc: Fully Static Graph".

The refactor give us a great opportunity to have a better distributed training architecture, here is the design doc.

loss = pd.op.square(pd.op.sub(pd.op.add(pd.op.mul(x, w), b), y))
train_op = optimizer.minimize(loss)
for i in range(10000):
paddle.eval(train_op)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that our OPs will be enough to handle all possible python logics. It would be perfectly fine that paddle.evel(train_op) can be distributed.

Copy link
Contributor Author

@helinwang helinwang Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment! after more discussion with @wangkuiyi and @reyoung , we think currently we don't have to support all possible Python logics (we could gradually add more such OPs, but we don't need it support all of them in the initial release). Closing this PR now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated the design doc, user can do the training in Python, the session.eval(train_op) gets distributed.

@putcn
Copy link
Contributor

putcn commented Sep 2, 2017

I like this idea. it would be great if the graph spec can be fully described as a programming language independent schema, which opens lots of possibilities.


## Background

There are two paradigms for expressing the computation graph: dynamic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I cannot get the difference between dynamic and static nets. Do you mean static nets are those who don't change with iterations, but dynamic nets do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have modified the design doc, this confusion no longer exists.


## Abstract

We propose the *fully static graph* rule: training and inference must
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relationship between "fully specified nets" and "static nets"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have modified the design doc, this confusion no longer exists.


The user can still use Python to achieve the same result for
convenience when experimenting locally, but the distributed training
will not support Python.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that users cannot program distributed PaddlePaddle jobs in Python? I am afraid that most users might not agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have modified the design doc, now the user can use Python.


There are two paradigms for expressing the computation graph: dynamic
and static. The dynamic paradigm constructs the graph on the fly:
every time `eval` is called, a new graph is created. The static
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know what is eval until I read the Python code snippet few paragraphs below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have modified the design doc, this confusion no longer exists.

paddle.eval(train_op)
```

The above code can only run locally.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this comparison makes sense when we allow users to write the train loop. Otherwise, as long as each worker node implements the loop, and users provide the loss as in this example, it would run both locally and distributedly.

Do we need to allow users to write ths train loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have modified the design doc, we allow user to write the train loop for flexibility, but once we provide while OP, user do not need to write the train loop in Python.

@@ -0,0 +1,122 @@
# Design Doc: Fully Static Graph
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing this design together with @reyoung, I noticed that this PR doesn't provide sufficient background information for readers to understand the problem.

As a design doc, this PR doesn't tell what is the problem. It directly goes to a conclusion. This is like firing a missile with no target.

It looks to me that this PR is trying to address the choice -- should we allow users to write the train loop, in hope that the PaddlePaddle program doesn't change much when running locally and distributedly.

To help readers understand this problem better, we can start from how a DL program looks like:

reader = new_reader("some_files")
for iterations:
  minibatch = reader.next()
  train(a_network, minibatch)

If we allow users to write the train loop, the users should also write the creation of the reader because the loop uses it. This means that users specify the reading and feeding of data. However, it doesn't make sense all workers run this loop over the same data -- what we want is that each worker loads part of the data.

It is true that most DL frameworks make the reader depends on a 0-based successive "worker id", e.g., distributed PyTorch, and this approach works with HPC clusters. However, in a fault-tolerable distributed computing environment, there is no such a worker id.

This is the reason that we might need to restrict the user from being able to write the train loop. And if we cannot make this restriction, how should we solve above problem. It seems that this document presents a solution -- to wrap the train loop into a PaddlePaddle operator, other than a Python loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking time to read the PR and write this very valuable feedback! It's very valuable to me. I will improve this PR to provide detailed background and reason behind the solution this PR proposes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the feedback! I provided detailed background and reason in the updated PR.

The dynamic graph has the advantage of being flexible but is highly
dependent on the host language (most commonly Python). The static
graph is not as flexible, but more optimization can be done since the
graph is known before computing happens. PaddlePaddle is using the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that the fault recovery is easier to do if we always use a serializable computational graph and I also agree that the dist-paddle should never know Python and how that computational graph generated.

But I prefer that The dist-paddle just run that graph at once and do not care whether that graph contains a for-loop or not. It should be determined by the client-side. Because

  1. It is clean and straightforward. The dist-paddle just run a graph. The user program just generates the graph and invoke dist-paddle.
  2. It is flexible. The user can create a whole graph or make many graphs for one training process (think about a pre-train network and a real network run in the same training process).

If we implement our framework like that, we should make the user experience in local mode as same as in distributed mode. We should consider the following things

  1. In local mode, the running logic should be same. Instead of sending the computation graph to multiple distributed Paddle, we just send it to local Paddle directly.
  2. In local mode, we can find or create a variable in Python. For distributed Paddle, where does the user find a variable? Maybe a distributed scope?
  3. Maybe we should not let the user define a data reader in Python. Users can only read their data by data_reader operator, which convert files into mini-batches. But there is a difference between local mode and distributed mode of data_reader operator is reading from local filesystem or reading from a distributed filesystem. And one more thing for the distributed data_reader operator is that the multiple nodes should read the whole epoch of data but each node should read the different part of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very valuable suggestion! I will reconsider the solutions and discuss with you offline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Updated.

@helinwang helinwang closed this Sep 7, 2017
@helinwang
Copy link
Contributor Author

helinwang commented Sep 7, 2017

Currently we don't have to support all possible Python logics (we could gradually add more such OPs, but we don't need it support all of them in the initial release). Closing this PR now.
There are many valuable comment in this PR, we can come back and look any time in the future.

@helinwang
Copy link
Contributor Author

Re-opened because make everything as an Op is a viable solution, we just don't have to support it in the first release. Instead, we can gradually add more Ops.

@helinwang helinwang reopened this Sep 8, 2017
@typhoonzero
Copy link
Contributor

Absolutely agree with the basic ideas of this doc, I'd like to discuss some details including some thing we discussed today through the meeting.

  • Representing a graph as a protobuf message. The graph which user design could be different for local training and multi-GPU training and distributed multi-GPU training. But we shouldn't let users care about how. We must "optimize" and re-structuring the graph before it can be run.
  • Distributed training and local training have different code would be fine I think, and probably more simple. For local we run pd.eval(graph, local_settings, ...) and for cluster we run pd.submit(graph, cluster_settings, ...). The "graph"s are the same, but settings vary. Then we can add "send" and "recv" ops automatically to the graph for cluster training according to the settings.

People generally have 3 things to configure to run:

  1. graph, which can be defined with Python and be serialized.
  2. data feeder, which can also be an "op".
  3. settings including:
    • hyper parameters
    • devices to use
    • resources(memory, GPU memory, threads, num_nodes, communication ports etc.)
    • graph optimization strategies

@helinwang
Copy link
Contributor Author

helinwang commented Sep 8, 2017

Representing a graph as a protobuf message. The graph which user design could be different for local training and multi-GPU training and distributed multi-GPU training. But we shouldn't let users care about how. We must "optimize" and re-structuring the graph before it can be run.

Yes! User should not need to deal with it, that's why having a graph representation is important. It's an important feature that makes PaddlePaddle much easier to use than TensorFlow when doing multiple GPU or multiple node training.

Distributed training and local training have different code would be fine I think, and probably more simple. For local we run pd.eval(graph, local_settings, ...) and for cluster we run pd.submit(graph, cluster_settings, ...). The "graph"s are the same, but settings vary. Then we can add "send" and "recv" ops automatically to the graph for cluster training according to the settings.

Agree, this PR is proposing to submit the job from commandline, but after discussing with @reyoung and seeing your comment, I agree it makes a lot of sense to be able to submit the job from Python. One very interesting way from @reyoung 's idea is very similar to pd.submit(graph, cluster_settings, ...), his idea is we can do pd.eval(graph, cluster_settings, ...) to send a request to Paddle Cloud (a controller process which controls all Paddle runtimes) for graph evaluation. The graph and parameters are cached on Paddle Cloud, so multiple pd.eval(graph, cluster_settings, ...) can be very efficient. In this way the "for OP" is not an requirement but an improvement (user can still use Python for loop with pd.eval as loop body, but using "for OP" would be more efficient).

graph, which can be defined with Python and be serialized.

Yes, the graph (one proto message) can be serialized. Besides, pd.submit(graph, cluster_settings, ...) or pd.eval(graph, cluster_settings, ...) should be able to take input tensor as argument and returns tensor, so tensor needs to be able to be serialized.

data feeder, which can also be an "op".

Exactly, in remote training, data feeder must be an OP, otherwise it's too inefficient.

settings including: ...

Maybe we can do things similar to TensorFlow:

sess = tf.remoteSession(settings...)
sess.run(...)

@helinwang helinwang changed the title Design Doc: Fully Static Graph Design Doc: Distributed Training Architecture Sep 15, 2017
For more information about `Session`, please
see [Design Doc: Session](./session.md).

### PaddlePaddle Converter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should converter be an independent process or a function in C++?

class BlockConverter {
public:
  virtual Block* Convert(const Block& block) const = 0;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in general, Backward and Optimize are converter, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think converter should be a library. On top of it we can wrap a grpc service.
On the cloud it could be a standalone process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in general, Backward and Optimize are converter, too.

Do you mean backward and optimizer need to be added to the graph and it should be done by the converter? I think it makes a lot of sense. In this way the converter input is what has defined by the user, which is unprocessed raw graph, giving the converter the most flexibility.

optimizer = paddle.optimizer.SGD(cost, learning_rate=0.01)
session = paddle.session.NewRemote(num_trainer=3, num_ps=2, GPU_per_trainer=1)
for i in range(1000):
_, cost_val = session.eval(target=[cost, optimizer])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_, cost_val = session.eval(targets=[optimizer, cost])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will change when I come back. Now getting online with mobile phone...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

PaddlePaddle should be able to modify the nerual network computation
definition to support model parallelism automatically. However, the
computation is only specified in Python code, and PaddlePaddle can not
modify Python code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can paddle support model parallelism automatically. Does users need to specify some operators in corresponding device? The topology is describe by python code, then convert to proto message. Paddle can modify proto message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, I think the converter should be able to make placement on device and node automatically. Optionally user can specify device constraints that the auto placement should satisfy.

I think we should not allow user to specify the node constraint, because it's too tedious and if we want the computation to be scalable, the model should not constraint itself to run on a fixed number of machines. Furthermore, I suspect even if we allow the user to specify the node constraint, there is still a lot of work to do. We may just take some more time to do the placement automatically.

General placement (model parallel) is not easy, we need more time to do it. I think we should not support model parallel at the beginning. Since PaddlePaddle 0.10.0 does not support it too.

For more information about `Session`, please
see [Design Doc: Session](./session.md).

### PaddlePaddle Converter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, the converter is a module run Blocks multi-passes without allocating the real memory. For example, in tensorflow, the first pass eliminates some links and rewrite the graph. The second pass inserts necessary pairs of send/recv operator and partitions graph into sub-graphs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct, the converter will traverse the block multiple passes.


#### session.eval

As shown in the graph, `session.eval` sends the IR and the evaluation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python session only is a wrapper of c++ session module, which exported interfaces of the session.

Copy link
Contributor Author

@helinwang helinwang Sep 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added:
The Python session is a wrapper of the C++ Session class. For more information about Session, please see Design Doc: Session.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM currently. I think we can merge this and #3993 ASAP, so we can start detailed designs.

@helinwang helinwang merged commit 1c0a4c9 into PaddlePaddle:develop Sep 26, 2017
@helinwang helinwang deleted the graph_runtime branch September 26, 2017 04:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants