-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParameterServer Design Doc #1964
Conversation
.ycm_extra_conf.py
Outdated
@@ -0,0 +1,133 @@ | |||
import os, sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件是做什么的?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already delete, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this files has not been deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
- parameters: 神经网络中的参数,包括权重w和偏置b。一个神经网络的模型由大量的参数组成 | ||
- shard: 分片,通常指将一个整体拆分成多份的其中的一份。 | ||
- parameter block: 多个parameter block构成一个model shard(现存的model并行策略是parameter block based,在新架构中继续沿用) | ||
- 单点故障: 任意时刻只可能同时有一台服务器故障。由于集群中同时存在两台机器故障的概率极低((平均故障率*平均故障修复时间)^2)只对特殊在线系统考虑两台以上同时故障的容灾。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一条普通情况是没有问题的,但是一个节点出问题不一定是由于故障造成的,可能是维护, 比如说kernel更新rolling update。考虑两台以上同时容灾跟一台容灾有区别吗?(如果没有的话,感觉这一点就可以删掉了)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
Optimizer需要支持的优化算法 | ||
|
||
L-BFGS,owlqn,ftrl, TODO:在Paddle中owlqn等需要参数更新方式不同,支持接口是否相同? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle(非容错版本)现在支持这些优化算法吗?接口是怎么样的?
发送接收数据和命令都使用rpc 接口,例如golang rpc | ||
|
||
```c++ | ||
class Evaluator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一大段代码是Paddle目前代码的节选,还是design的parameter server库的接口?
PClient功能是否已经包含在trainer中?PClient 负责parameter balancer,打包rpc请求转发PServer。 | ||
|
||
```c++ | ||
class ParameterPartitioner; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前Paddle是怎么做parameter partition的?
这里提出的ParameterPartitioner的对象是什么?是以每个layer中的矩阵为单位来partition吗?
static createRequest(RPCRequest*, RPCResponse); | ||
static AsyncRPCServer& singleton(); | ||
// send rpc call asynchronize | ||
void send_async(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么send_async
和send_sync
是AsyncRPCServer的函数,而不是RPCRequest的函数?
|
||
- 启动和运行参数包括: | ||
|
||
`/PS_DESIRED`:, 启动PServer instance个数,etcd存储格式 `/PS_DESIRED:3` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS_DESIRED以后支持扩容之后是可变的,可能需要从etcd里读,而不是启动parameter server的时候作为命令行参数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
支持etcd读取和命令行读取两种方式,之后规定的 ·etcd存储格式· 表示运行时候将会存储于etcd中,重启或者re-scaling时候会用到该信息
```c++ | ||
// Used for request, package up request into binary | ||
struct RpcRequest { | ||
uint64_t _request_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要写明一下每个参数是做什么,直接看无法完全看明白。
|
||
PServer负责: | ||
|
||
模型存储,模型更新,注册服务并监听端口事件,PServer个数的动态扩张收缩,负责序列化传输数据。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前的版本应该还没有考虑PServer个数的动态扩张收缩,如果要支持这个功能,需要引入在增减PServer时需要完成的parameters等数据的re-hash工作
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已在PClient design部分有考虑,属于slicer划分模块
} | ||
``` | ||
|
||
<img src="src/hashring.png" width="300"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个图并没有文字说明其含义
|
||
sgd (momentum, adagram, adadelta, adam),pass based | ||
|
||
async-sgd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
以上的优化算法是否使用相同的一套通信API,还是选择不同的优化算法会对应不同的API和trainer通信?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
由于实现上的差异,对于owlqn等算法API应不同。目前支持sgd类优化算法,该接口简化
|
||
## PClient | ||
|
||
PClient功能是否已经包含在trainer中?PClient 负责parameter balancer,打包rpc请求转发PServer。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design doc中不应该有“?”,应该都是陈述句的描述。有不确定的问题可以提issue来讨论获得结论。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix Done
|
||
etcd存储格式 `ROOT_PORT:8000, /PS/0:8000,/PS/1:8001 ` | ||
|
||
`/CHECKPOINT_PERIOD`:PServer运行保存快照存储的时间间隔,default filled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
此项配置应该不需要存在etcd,从命令行传入参数就行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHECKPOINT_PERIOD 是从RPC Call 的方式请求过来的,因为提供灵活性的话需要用户event_handler去指定一个check_point时间去save 某个pass_id 的模型。考虑到CHECKPOINT_PERIOD是PServerConfig的一部分,因此存储在etcd,该参数单独存与不存都可以
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果是为了能让client设置pserver保存checkpoint的时间,为什么不直接让trainer调用RPC直接触发一次checkpoint呢?CHECKPOINT_PERIOD
我理解也没有需要在训练过程中修改的需求。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[reply]为什么不直接让trainer调用RPC直接触发一次checkpoint呢
现在就是trainer调用RPC触发一次checkpoint的额。。,准确的说,是由用户的python程序制定event_handler,设定CHECKPOINT_DIR, CHECKPOINT_PERIOD,传递到trainer,trainer以RPC调用的方式传到PServer。
[reply]CHECKPOINT_PERIOD我理解也没有需要在训练过程中修改的需求。
之前考虑的情况是,某个PServer恢复/新增需要这个参数。。。
delete it
|
||
- PServer: Parameter Server 服务器 | ||
- PClient: Parameter Server Client | ||
- PServerController:PServer管理员,启动Server,动态扩容,容灾等 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一版不考虑动态扩容就先不写了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一版暂不实现该功能,design doc写出来评估功能划分
## 术语 | ||
|
||
- PServer: Parameter Server 服务器 | ||
- PClient: Parameter Server Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
统一一下吧?Parameter Server服务器,Parameter Server 客户端?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
- PServerController:PServer管理员,启动Server,动态扩容,容灾等 | ||
- model: 指深度学习训练之后得到的所有参数,使用这个神经网络可以完成对新数据的预测 | ||
- parameters: 神经网络中的参数,包括权重w和偏置b。一个神经网络的模型由大量的参数组成 | ||
- shard: 分片,通常指将一个整体拆分成多份的其中的一份。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的shard和下一行中的model shard是同一个东西么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
## PServerController | ||
|
||
根据ParameterServer参数创建和管理PServer instance,从命令行读取参数,从etcd读取参数,运行开始将存活的PServer instance配置存储在etcd中 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这是一个独立的进程吗?和PServer是1:1还是1:n的关系呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是1:n的关联,是一个独立进程。暂时没考虑它挂了会怎么样。。。
|
||
- 启动和运行参数包括: | ||
|
||
`/PS_DESIRED`:, 启动PServer instance个数,etcd存储格式 `/PS_DESIRED:3` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多了一个,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
`/PS_DESIRED`:, 启动PServer instance个数,etcd存储格式 `/PS_DESIRED:3` | ||
|
||
`/ROOT_PORT`:显式指定根端口,PServer端口从PORT+1开始,直到找到可用端口,例如ROOT_PORT=8000, 则PServer0_Port=8000,PServer0_Port=8001,…,当前存在的PServer实例配置以etcd实时存储为准 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
从PORT+1开始=>从PORT开始,递增+1? 另外如果启动在Container中应该不存在端口冲突的问题。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- PServer0_Port=8000,PServer0_Port=8001* => * PServer0_Port=8000,PServer1_Port=8001*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yancey1989 Good question! 昨晚跟志宏还有龙飞的讨论中决定只用一个port。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@helinwang 明白了👍
https://github.com/dzhwinter/Paddle/blob/develop/doc/design/cluster_train/PServer_design_doc.md may be more clear to review. |
.ycm_extra_conf.py
Outdated
@@ -0,0 +1,133 @@ | |||
import os, sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this files has not been deleted.
@@ -0,0 +1,261 @@ | |||
# Design Doc: Parameter Server Process |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not designing a process, but designing a server program. Therefore the title should be
Design Doc: Parameter Server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
@@ -0,0 +1,261 @@ | |||
# Design Doc: Parameter Server Process | |||
|
|||
Parameter Server process 是Paddle中负责模型的存储,更新和模型分片一致性的组件,在整个系统中的作用请参考 [distributed training design doc](./README.md) ,本文档包含PServer,PClient,PServerContoller等,涉及到的配置参数均使用大写字母 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A server is a program. A server instance is a process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
@@ -0,0 +1,261 @@ | |||
# Design Doc: Parameter Server Process | |||
|
|||
Parameter Server process 是Paddle中负责模型的存储,更新和模型分片一致性的组件,在整个系统中的作用请参考 [distributed training design doc](./README.md) ,本文档包含PServer,PClient,PServerContoller等,涉及到的配置参数均使用大写字母 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“存储”,“更新”,“分片”不是并列的概念。不应该这样罗列在一起。
Parameter server 的唯一的用途是分布式训练时更新模型。基于 parameter server的分布式模型更新算法要求 parameter server上存储一份模型。为了让 trainers 访问这份模型的效率比较高,我们要分片。实际上,用户是可以不分片的,只需要仅仅启动一个 parameter server instance。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
恩恩,这么写概念的确不准确,对一个parameter server instance不成立。之前考虑和下文中的设计一致。fix it
- PServer: Parameter Server Server,负责模型存储,调用分布式更新,响应PClient请求 | ||
- PClient: Parameter Server Client,负责均衡PServer请求,打包并转发RPC请求 | ||
- PServerController:负责启动Server,动态扩容,容灾等 | ||
- model: 指深度学习训练之后得到的所有参数,使用这个神经网络可以完成对新数据的预测 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model = topology + parameters, model != parameters
|
||
1、模型存储,2、注册服务并监听端口事件,3、PServer个数的动态扩张收缩,4、负责序列化传输数据。 | ||
|
||
发送接收调用都使用rpc 接口,见下文中的RPCServer,例如使用golang rpc实现对应的接口 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
英语里,缩写应该都是大写字母。RPC = remote procedure call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, fix these typo....
|
||
1、模型存储,2、注册服务并监听端口事件,3、PServer个数的动态扩张收缩,4、负责序列化传输数据。 | ||
|
||
发送接收调用都使用rpc 接口,见下文中的RPCServer,例如使用golang rpc实现对应的接口 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
golang ==> Go
golang 只是一个域名(golang.org),因为go.org已经被注册了。语言的名字叫Go。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
|
||
发送接收调用都使用rpc 接口,见下文中的RPCServer,例如使用golang rpc实现对应的接口 | ||
|
||
```c++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was aforementioned that we are going to write the parameter server in Go, why here pastes some C++ code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个只是描述接口,用于讨论。写初稿时候对Go还不熟
class SparseParameterUpdater { | ||
|
||
} | ||
/* 目前支持sgd 类算法,不支持owlqn, L-BFGS等算法 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgd ==> SGD (Stochastic Gradient Descent)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
class SparseParameterUpdater { | ||
|
||
} | ||
/* 目前支持sgd 类算法,不支持owlqn, L-BFGS等算法 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
owlqn ==> OWLQN (Orthant-Wise Limited-memory Quasi-Newton)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
- PClient: Parameter Server Client,负责均衡PServer请求,打包并转发RPC请求 | ||
- PServerController:负责启动Server,动态扩容,容灾等 | ||
- model: 指深度学习训练之后得到的所有参数,使用这个神经网络可以完成对新数据的预测 | ||
- Tensor: 一个NDArray结构,Trainer与PServer, PClient交互的基本数据结构 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle需要把一个Tensor切成小片,然后发给PServer,所以基本单位应该是array of float32 / int32。(以前代码的parameter block就是做这个的)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对ParameterServer的确是array of float32 / int32,统一概念写成Tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typedef 1-d Tensor就是array of float32 / int32
发送接收调用都使用rpc 接口,见下文中的RPCServer,例如使用golang rpc实现对应的接口 | ||
|
||
```c++ | ||
class Evaluator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Design Doc只要说明接口就好了。这里的private变量/method都写出来了,第一是实现的时候基本不会按照暴露出来的细节实现(会用Go语言实现RPC Server),第二是把这么多细节写出来读者就晕了。Design Doc需要简练。
/* part 3.a. Trainer/worker auto scaling insert or remove during training */ | ||
unordered_map<string/*trainer name*/, Trainer*> | ||
/* part 3.b. PServer auto scaling during training */ | ||
rehash key based on Pserver, see PClient Part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这行应该是注释?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
- shard: 全量模型在某个PServer上的局部分片,通常指将一个Model整体拆分成多份的其中的一份。 | ||
- parameter block: 多个parameter block构成一个model shard(现存的model并行策略是parameter block based,在新架构中继续沿用) | ||
|
||
## PServer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PServer, PClient, RPCServer, PServerController是指程序中的多个module或class么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,名字容易混淆,已经rename
|
||
`/CHECKPOINT_DIR`:保存快照的路径,default filled | ||
|
||
- 创建PServer接口 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果要贴接口代码,建议和 @helinwang 的Trainer Library一样,单独列一个ISSUE为接口设计
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌,是个好办法
|
||
RWLock lock; | ||
/* ParameterServer_id used by checkpoint */ | ||
int32_t ParameterServer_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据Camel case的习惯,ParameterServer_id
=> ParameterServerID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
## 术语 | ||
|
||
- ParameterServer: ParameterServer Server,负责模型存储,调用分布式更新,响应ParameterClient请求 | ||
- ParameterClient: ParameterServer Client,负责均衡ParameterServer请求,打包并转发RPC请求 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没太理解均衡ParameterServer请求是什么意思,是说ParameterServer需要处理来自多个Trainer的并发请求么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
均衡请求这个词有误会,和Trainer无关。是将请求的Tensor切分,去不同的ParameterServer instance上update,保证各个instance压力差不多
|
||
- ParameterServer: ParameterServer Server,负责模型存储,调用分布式更新,响应ParameterClient请求 | ||
- ParameterClient: ParameterServer Client,负责均衡ParameterServer请求,打包并转发RPC请求 | ||
- ParameterServerController:负责启动Server,动态扩容,容灾等 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 我不太确定这部分要在哪里做,因为启动PServer Instance可能和不同的集群调度机制有关,例如Kubernetes会提交YAML,MPI需要提交MPI的任务等。
- 更新etcd这些操作是否要统一放在Master节点来完成呢,PServer,Trainer与Master节点进行RPC通信而不直接更新etcd?
也请 @typhoonzero @helinwang 关注一下,多谢!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParameterServerController 是一个独立模块,目的就是屏蔽不同的集群调度机制。
@helinwang 第二点实现在Master节点如何?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yancey1989 @dzhwinter @typhoonzero 昨晚跟志宏讨论结果:启动Server应该是cluster management system来做(比如Kubernetes)。这个版本不考虑动态调节pserver个数,怎么更新etcd就先不要设计了吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
明白了,👍
ParameterServer Design Doc