Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design Doc: The Client Library of Parameter Server #2075

Merged
merged 15 commits into from
May 13, 2017

Conversation

helinwang
Copy link
Contributor

@helinwang helinwang commented May 10, 2017

Maybe here is better for review.

// to all parameter servers.
func (*PServerClient) Save(path string) error
```
Please see [master server design doc](master_server.md) for the definition of `master.Task`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a dead link

Copy link
Contributor Author

@helinwang helinwang May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is in a PR. Will be a good link after the PR being merged.


// GetTask gets a new task by telling the master server the finished task.
// Use nil as the finished task when getting the task for the first time.
func (*MasterClient) GetTask(finished master.Task) (master.Task, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finish a taskGet a new task是否分成两个函数比较好?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我把Master部分删了,下次再发一个PR,会分成两个函数。

// Parameter is a piece of data to sync with the parameter server.
type Parameter struct {
Name string
ElementType ElementType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量名和变量的类型是否需要区分一下?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这倒不影响,因为ElementType作为变量名的调用方式是Parameter.ElementType,不会与全局scope的ElementType冲突。


The Go interface is the basic abstraction of communications with the master server and parameter servers. We will add another layer on top (add retry logic, polish interface with C idiom) before exposing the library with a [C interface](#c-interface).

```go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these interfaces kind of duplicate to #1964? The communication interfaces are export by the server and will be called remotely by the client. So the server and the client need to use the same piece of interface definition code.

Copy link
Contributor Author

@helinwang helinwang May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client library is not just a simple wrapper on top of RPC. For example ParamInitChan will use an etcd lock to figure out who will init the parameters (does not call parameter server RPC).

Indeed, the API here highly depends on the parameter server RPC API. So this PR and #1964 depends on each other. This in the API in my mind, I will be discussing with @dzhwinter and everyone to form the same API, borrowing ideas from both PRs.

@typhoonzero
Copy link
Contributor

typhoonzero commented May 10, 2017

Well, in general, there are two ways of implement a parameter server:

  1. Pre-define all the possible optimization algorithms, and implement these algorithms inside parameter server. So the client just calls the "algorithm-id" with gradients and some arguments.
  2. Define basic OPs inside parameter server like "tensor add" or "tensor multiply" etc. The client then send gradients and an expression describing the optimization algorithm, eg. error := SendGrad("momentum(param, grads, lita, ...)", grads)

It's the first way in this document. Is the second choice possible? All trainers will use the same code, so passing a different expression to parameter server seems not a problem?


@typhoonzero The second way is possible to implement, but it's hard for me to see the benefit of it:

I think what user wants is to choose an update algorithm,
so implementing 1 makes the logic simple - the user just needs to specify a description of the predefined algorithm.
Implementing 2 is complex: the expressiveness of a single string is limited. For example, how can one express the update rule that is not stateless (e.g., some state is saved between gradient updates) with a single string? To me, it feels like 2 make the system a lot more complex, but I could not see a clear benefit (maybe I am wrong).

Helin


@helinwang @typhoonzero
we will go to Operator/kernel implement at the same time, we need to consider it in the trainer process.
Dong zhihong


@@ -0,0 +1,82 @@
# Design Doc: Trainer Communication Library
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design Doc: Trainer Communication Library

==>

Design Doc: The Client Library of Parameter Server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,82 @@
# Design Doc: Trainer Communication Library

For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the trainer's communication library, which will manage communication with parameter servers and the [master server](master_server.md). The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client library should focus on talking to parameter server, not to the master process, because it is the client library of the parameter servers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I removed the master communication part.


For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the trainer's communication library, which will manage communication with parameter servers and the [master server](master_server.md). The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file.

## Go Interface
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go Interface

==>

Go Implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the Go API with C API. Go Implementation can be discussed later.


## Go Interface

The Go interface is the basic abstraction of communications with the master server and parameter servers. We will add another layer on top (add retry logic, polish interface with C idiom) before exposing the library with a [C interface](#c-interface).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We write this client library in Go, because the parameter server is written in Go. To make the client library callable from trainers, which is written using C/C++, we need a C wrapper of this Go library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the Go API with C API.

type Parameter struct {
Name string
ElementType ElementType
Buffer []byte
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buffer => Content ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to content (in C API, Go API has been removed.)

@jacquesqiao jacquesqiao self-requested a review May 10, 2017 23:16
@helinwang helinwang changed the title Trainer Communication Library design doc, the Go interface part Design Doc: The Client Library of Parameter Server May 11, 2017
## C Interface

```c
#define PADDLE_ELEMENT_TYPE_INT32 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define => enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done.

#define PADDLE_ELEMENT_TYPE_FLOAT32 4
#define PADDLE_ELEMENT_TYPE_FLOAT64 5

typedef struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to specify a namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c里面貌似没有namespace,一般用前缀来防止link时候的冲突,比如我们用了paddle_

* @param learning_rate the learning rate for the gradients.
* @return 0 if successful, otherwise -1.
*/
int paddle_send_grads(paddle_pserver_client* client, const paddle_gradient* grads, int total, double learning_rate);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use the same interface with "set_parameter" better? otherwise, the set_parameters only used once.

Copy link
Contributor Author

@helinwang helinwang May 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the set_parameter interface, only keeping paddle_send_grads.
Here are the reasons:
As far as I know, set_parameter is only used for saving the learning rate (or optimizer related stuff) so that next the user starts the training job, the learning rate can be the value when the model is saved. I think such use case does not justify the complication added to the system:

  1. What should we do when saving parameters, do we saves these parameters as well?
  2. Make the interface harder to understand.

If the user wants to use a different learning rate the next time he starts a training job, he can print the learning rate to screen and change the python code accordingly.

If we think this use case is important, maybe we can discuss and add later? For now I think we need to choose the simple way.

Maybe there are other use cases for set_parameter that I am unaware of?

* @return 0 if successful, otherwise -1.
*/
int paddle_send_grads(paddle_pserver_client* client, const paddle_gradient* grads, int total, double learning_rate);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, I want to know where the Updater/optimizer should be.
The trainer need to calculate gradients before send grads, and the pserver also need optimizer to store the training state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually here it's only sending the learning rate, not the update method. The update method is sent here using Proto Buffer.

@reyoung
Copy link
Collaborator

reyoung commented May 11, 2017

请问下这个PR的设计里是否考虑了稀疏更新和正则化的问题?这两点在PServer端实现会比较复杂。

@helinwang
Copy link
Contributor Author

@reyoung 谢谢提醒!如何支持Sparse更新以及正则化的支持我都加到design doc里了,请看更新版本。
另外还总结了些,写到了:#2106 (comment)


## Parameter Partition

Each parameter will be partitioned into parameter chunks to make the parameters evenly distributed on parameter servers. The partition is done automatically by the client library. The *sparse parameter* require a little different treatment:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, we should unify the Parameter Partition name with ParameterServer, I just wrote ParameterBlock as the Paddle already have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Renamed parameter chunks to parameter blocks.

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Yancey1989 Yancey1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@helinwang helinwang merged commit 1ba8206 into PaddlePaddle:develop May 13, 2017
@helinwang helinwang deleted the trainer_design branch May 13, 2017 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants