-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design Doc: The Client Library of Parameter Server #2075
Conversation
doc/design/cluster_train/trainer.md
Outdated
// to all parameter servers. | ||
func (*PServerClient) Save(path string) error | ||
``` | ||
Please see [master server design doc](master_server.md) for the definition of `master.Task`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a dead link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file is in a PR. Will be a good link after the PR being merged.
doc/design/cluster_train/trainer.md
Outdated
|
||
// GetTask gets a new task by telling the master server the finished task. | ||
// Use nil as the finished task when getting the task for the first time. | ||
func (*MasterClient) GetTask(finished master.Task) (master.Task, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish a task
和Get a new task
是否分成两个函数比较好?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我把Master部分删了,下次再发一个PR,会分成两个函数。
doc/design/cluster_train/trainer.md
Outdated
// Parameter is a piece of data to sync with the parameter server. | ||
type Parameter struct { | ||
Name string | ||
ElementType ElementType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
变量名和变量的类型是否需要区分一下?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这倒不影响,因为ElementType作为变量名的调用方式是Parameter.ElementType,不会与全局scope的ElementType冲突。
doc/design/cluster_train/trainer.md
Outdated
|
||
The Go interface is the basic abstraction of communications with the master server and parameter servers. We will add another layer on top (add retry logic, polish interface with C idiom) before exposing the library with a [C interface](#c-interface). | ||
|
||
```go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these interfaces kind of duplicate to #1964? The communication interfaces are export by the server and will be called remotely by the client. So the server and the client need to use the same piece of interface definition code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client library is not just a simple wrapper on top of RPC. For example ParamInitChan
will use an etcd lock to figure out who will init the parameters (does not call parameter server RPC).
Indeed, the API here highly depends on the parameter server RPC API. So this PR and #1964 depends on each other. This in the API in my mind, I will be discussing with @dzhwinter and everyone to form the same API, borrowing ideas from both PRs.
Well, in general, there are two ways of implement a parameter server:
It's the first way in this document. Is the second choice possible? All trainers will use the same code, so passing a different expression to parameter server seems not a problem? @typhoonzero The second way is possible to implement, but it's hard for me to see the benefit of it: I think what user wants is to choose an update algorithm, Helin @helinwang @typhoonzero |
doc/design/cluster_train/trainer.md
Outdated
@@ -0,0 +1,82 @@ | |||
# Design Doc: Trainer Communication Library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design Doc: Trainer Communication Library
==>
Design Doc: The Client Library of Parameter Server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/cluster_train/trainer.md
Outdated
@@ -0,0 +1,82 @@ | |||
# Design Doc: Trainer Communication Library | |||
|
|||
For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the trainer's communication library, which will manage communication with parameter servers and the [master server](master_server.md). The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client library should focus on talking to parameter server, not to the master process, because it is the client library of the parameter servers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I removed the master communication part.
doc/design/cluster_train/trainer.md
Outdated
|
||
For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the trainer's communication library, which will manage communication with parameter servers and the [master server](master_server.md). The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file. | ||
|
||
## Go Interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go Interface
==>
Go Implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the Go API with C API. Go Implementation can be discussed later.
doc/design/cluster_train/trainer.md
Outdated
|
||
## Go Interface | ||
|
||
The Go interface is the basic abstraction of communications with the master server and parameter servers. We will add another layer on top (add retry logic, polish interface with C idiom) before exposing the library with a [C interface](#c-interface). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We write this client library in Go, because the parameter server is written in Go. To make the client library callable from trainers, which is written using C/C++, we need a C wrapper of this Go library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced the Go API with C API.
doc/design/cluster_train/trainer.md
Outdated
type Parameter struct { | ||
Name string | ||
ElementType ElementType | ||
Buffer []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer => Content ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to content (in C API, Go API has been removed.)
## C Interface | ||
|
||
```c | ||
#define PADDLE_ELEMENT_TYPE_INT32 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define => enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
#define PADDLE_ELEMENT_TYPE_FLOAT32 4 | ||
#define PADDLE_ELEMENT_TYPE_FLOAT64 5 | ||
|
||
typedef struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to specify a namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c里面貌似没有namespace,一般用前缀来防止link时候的冲突,比如我们用了paddle_
。
* @param learning_rate the learning rate for the gradients. | ||
* @return 0 if successful, otherwise -1. | ||
*/ | ||
int paddle_send_grads(paddle_pserver_client* client, const paddle_gradient* grads, int total, double learning_rate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use the same interface with "set_parameter" better? otherwise, the set_parameters only used once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the set_parameter interface, only keeping paddle_send_grads.
Here are the reasons:
As far as I know, set_parameter is only used for saving the learning rate (or optimizer related stuff) so that next the user starts the training job, the learning rate can be the value when the model is saved. I think such use case does not justify the complication added to the system:
- What should we do when saving parameters, do we saves these parameters as well?
- Make the interface harder to understand.
If the user wants to use a different learning rate the next time he starts a training job, he can print the learning rate to screen and change the python code accordingly.
If we think this use case is important, maybe we can discuss and add later? For now I think we need to choose the simple way.
Maybe there are other use cases for set_parameter that I am unaware of?
* @return 0 if successful, otherwise -1. | ||
*/ | ||
int paddle_send_grads(paddle_pserver_client* client, const paddle_gradient* grads, int total, double learning_rate); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, I want to know where the Updater/optimizer should be.
The trainer need to calculate gradients before send grads, and the pserver also need optimizer to store the training state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually here it's only sending the learning rate, not the update method. The update method is sent here using Proto Buffer.
请问下这个PR的设计里是否考虑了稀疏更新和正则化的问题?这两点在PServer端实现会比较复杂。 |
@reyoung 谢谢提醒!如何支持Sparse更新以及正则化的支持我都加到design doc里了,请看更新版本。 |
|
||
## Parameter Partition | ||
|
||
Each parameter will be partitioned into parameter chunks to make the parameters evenly distributed on parameter servers. The partition is done automatically by the client library. The *sparse parameter* require a little different treatment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we should unify the Parameter Partition name with ParameterServer, I just wrote ParameterBlock
as the Paddle already have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Renamed parameter chunks to parameter blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Maybe here is better for review.