-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft about Python Package Design #3569
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,182 @@ | ||
# Paddle Python模块设计 | ||
|
||
![DOT](http://api.paddlepaddle.org/graphviz?dot=https://raw.githubusercontent.com/reyoung/graphviz_dots/master/refactor/python_arch.dot) | ||
|
||
这篇设计文档讨论Paddle重构后,如何设计Python模块。使面对用户的最高层是我们的`paddle.v2` API,而面对C++的是我们最底层重构后的`Operator`, `Scope`, etc。即上图中红色模块的部分。 | ||
|
||
|
||
经过大家的讨论和简单的书写,大家对于Python模块设计达成了初步共识。初步设计为 | ||
|
||
![design](https://raw.githubusercontent.com/reyoung/graphviz_dots/master/refactor/graph.png) | ||
|
||
## Model | ||
|
||
Model是训练神经网络需要的一些信息的汇总,其中记录: | ||
|
||
1. 拓扑结构栈(栈是为了实现RNN或IfElseOp) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tf 里, session只需要一个, session.run() 方法会创建新的拓扑,比如
如果这里 model 类似于 tf.session,那应该只需要一个 tf.device('/cpu'):
a = tf.tensor(xxx)
b= tf.tensor(xxz)
tf.device('/gpu:0'):
a0 = tf.tensor(xxx)
b0 = tf.tensor(xxx)
# device bind
tf.device('/gpu:1'):
aa = tf.add(a, a0)
bb = tf.add(b, b0)
session = tf.Session()
# create two subgraph easily
aa_tensor = session.run([aa])
bb_tensor = session.run([bb]) 类似这样的写法,Model感觉写起来不自然吧 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 因为device目前是Model的成员变量,Model只能支持单device。如果想要跨设备的构建,那就需要多个Model。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Superjom 我们目前不考虑多设备的情况。 |
||
2. Scope栈 | ||
3. 参数初始化网络 | ||
4. 参数,以及参数和梯度的对应关系 | ||
5. 设备信息 | ||
|
||
其中,在Caffe2中与之对应的类型叫做`model_helper`,而Tensorflow中与之对应的类型是`session`。在春伟的PR中,他将类似的概念成为[`session`](https://github.com/PaddlePaddle/Paddle/pull/3566/files#diff-b6e4bb9095a126ed31ee1cdae03af483R9)。 | ||
|
||
需要注意的是,Model中并不实现某一层,譬如`fc_layer`是实现在全局函数中,而不是实现在Model类中。 | ||
|
||
同时,Model中的设备信息默认值是空。用户可以一开始设置设备,也可以在运行前设置设备。每一个model只绑定一个设备。如果一个Model已经被运行过(即调用过init_param()或run()),则Model的设备不能再变更。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. set device place 这个应该是比较简单的操作吧, 为啥要model级别设置 device呢,一个 net 应该能容纳有不同 device 的op吧,op为级别设置 device 应该更灵活吧 参考 # sub-graph on gpu 0
tf.device('/gpu:0'):
a = tf.tensor()
b = tf.tensor()
c = tf.add(a, b)
# subgraph
tf.device('/gpu:1'):
d = tf.add(a, c)
# subgraph
tf.device('/gpu:2'):
e = tf.add(c, d)
session = tf.Session()
# a subgraph execute on 3 devices, gpu0, gpu1, gpu2
e = session.run([e]) 这样可以灵活设置 device, model 不需要管 topology(自动生成就行), 也不需要设定 device(op 可以单独设置不同的 device) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Superjom 我们目前不考虑多设备的情况。
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. set device place这个操作并不简单。特别是已经设置过设备的Tensor和Op换一个设备,非常麻烦。并且,设备之间的通信要怎么处理还是得再想想。 |
||
|
||
`Model.run()`可以接受结束的`Expression`,这时候Paddle会运行这个模型中需要计算出这个Expression的子图。如果不指定,则运行全部Model。提取子图的过程可以被缓存(cache)住。提取子图算法描述如下: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 可以先说一下什么是 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. “这时候Paddle会运行这个模型中需要计算出这个Expression的子图” |
||
|
||
```python | ||
def extract_subnet_from(Net net, set<string> need_vars, int from=-1): | ||
end = max(len(net) - 1, from) | ||
subnet = [] | ||
for (; end > 0; --end): | ||
op = net[end]; | ||
if any of op.outputs in need_vars: | ||
subnet.append(op) | ||
need_vars.extend(op.inputs) | ||
|
||
return subnet.reverse() | ||
``` | ||
|
||
用户可以创建任意多个Model,不同的Model可以分享同样的根 Scope。Paddle默认会有一个全局的Model。即 | ||
|
||
```python | ||
g_model = Model() | ||
``` | ||
|
||
## Model Methods | ||
|
||
我们使用一些全局函数(或者全局类)修改Model。这些全局函数参数不同,实现不同,但都满足如下条件: | ||
|
||
* 参数中带有`model`,即`fc(..., model=None)`。 如果用户不设置`model`的话,就是用默认全局的`g_model`。(这里实现了和v2 API的兼容。v2 API相当于使用了全局`model`的API) | ||
* 接受其他Model Method的输入,即如果某一参数是其他层的输出,类型是Expression。 | ||
* 所有model method 返回 一个 Expression 或 Expression数组 或 None | ||
* Model Method 修改Model中的拓扑结构,默认使用 `model.cur_net()` 获得栈顶网络。 | ||
|
||
大体上,Model Methods可以分为三类: | ||
|
||
1. data_layer/parameter等不修改拓扑结构的model method | ||
2. fc/conv等通过用户配置来修改拓扑结构的model method | ||
3. sgd/backward等根据Model本身的情况来修改拓扑结构的model method | ||
|
||
|
||
### 不修改拓扑结构的model method | ||
|
||
为了同一Model Methods的输入输出类型,`data_layer`返回值也是一个Expression。一个DataLayer的实现方式为: | ||
|
||
```python | ||
def data_layer(name, shape, model=None): | ||
if model is None: | ||
model = g_model | ||
|
||
model.root_scope().new_var(name).get_tensor().resize(shape) | ||
return Expression(name=name, model=model, op_pos=-1) | ||
``` | ||
|
||
而同理,对于模型的参数,也需要通过`parameter`讲其转换为Expression。方法如下: | ||
|
||
```python | ||
def parameter(name, dim, attr, model=None): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这里的 |
||
if model is None: | ||
model = g_model | ||
|
||
# params are always created in root scope. | ||
if model.root_scope().find_var(name) is not None: | ||
# param has been created before, do not create it again. | ||
return Expression(name=name, model=model, op_pos=-1) | ||
model.root_scope().new_var(name).get_tensor() | ||
|
||
# This line could be changed by attr, not only uniform is supported. | ||
# just a demo here | ||
model.init_net.create_and_add_op("uniform_random", **attr) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 这里是不是应该指定下被初始化的参数名字?也就是name参数 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 对的,这些参数名字应该会放到attr中。 |
||
|
||
model.param_names.add(name) | ||
return Expression(name=name, model=model, op_pos=-1) | ||
``` | ||
|
||
### 通过用户配置来修改拓扑结构的model method | ||
|
||
通过用户配置来修改拓扑结构的Model Method需要注意的是: | ||
|
||
1. 要使用`model.cur_net()`和`model.cur_scope()`来获取网络。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 如果支持 model.run([targets]) , 感觉 net 就不需要暴露出来了吧, tf.session.run 里完全没有类似net这个概念,用户也完全不需要知道拓扑细节,只需要关心自己需要哪一个 target 就可以了 我要 target0 的 tensor 值,就 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
2. 如果要创建函数,需要使用`parameter`函数。 | ||
|
||
为什么一定要使用`cur_net()`,即栈顶的网络来创建新的拓扑结构呢?原因如下: | ||
|
||
Paddle中的`NetOp`是一个op的数组。而RNNOp以及类似的IfElseOp都保存有一到多个内部网络。 | ||
|
||
```python | ||
class NetOp(): | ||
vector<Op*> ops_; | ||
|
||
class RNNOp(): | ||
NetOp step_net_; | ||
``` | ||
|
||
而含有RNNOp的网络创建过程,如下图所示 | ||
|
||
![stack_of_op](https://raw.githubusercontent.com/reyoung/graphviz_dots/master/refactor/stack_of_op.png) | ||
|
||
一个fc层函数的示例为: | ||
|
||
```python | ||
def fc(input, size, param_attr=None, bias_attr=False, act="sigmoid", model=None): | ||
model = default_model(model) | ||
dim = input.tensor().get_dims() | ||
w = parameter("w", dim=[dim[1], size], param_attr) | ||
tmp = model.cur_net().create_and_add_op("mul", X=input, Y=w) | ||
if bias_attr: | ||
b = parameter("b", dim=[size], bias_attr) | ||
tmp = model.cur_net().create_and_add_op("rowwise_add", X=tmp, Y=b) | ||
|
||
if act: | ||
tmp = model.cur_net().create_and_add_op(act, X=tmp) | ||
|
||
return Expression(tmp, model) | ||
``` | ||
|
||
### 根据Model本身的情况来修改拓扑结构的model method | ||
|
||
这里主要是指backward或者SGD等函数。他们虽然是对model的修改,但是不只添加一个或者有限个Op,而是根据model现有状况添加Op。譬如SGD method | ||
|
||
```python | ||
def sgd(learning_rate=1e-3, model=None): | ||
model = default_model(model) | ||
ret_val = [] | ||
for param_name in model.param_grad_map: | ||
grad_name = model.param_grad_map[param_name] | ||
|
||
model.root_net().create_and_add_op("sgd", learning_rate=1e-3, X=param_name, Y=grad_name, Out=param_name) | ||
ret_val.append(Expression(param_name, model)) | ||
|
||
return ret_val | ||
``` | ||
|
||
而backward则主要调用C++中的`Backward`函数,并将生成的Network添加进`model.root_net()` | ||
|
||
|
||
|
||
## Expression | ||
|
||
Expression是统一后的Model Method的输入输出。春伟的PR中将这一概念命名为[`Var`](https://github.com/PaddlePaddle/Paddle/pull/3566/files#diff-c00711137ba1e93c609c893c649d302c)。经过讨论同步,这一概念需要有的操作有两个,他们是: | ||
|
||
```python | ||
class Expression(): | ||
def name(): | ||
return str('final variable name of this expression') | ||
|
||
def value(): | ||
return numpy.ndarry('variable value after calculating') | ||
``` | ||
|
||
在Expression中,保存了: | ||
|
||
1. Model的指针 | ||
2. 计算出这个变量的Op在Model中的位置(如果是RNN,就是子图中的位置)。 | ||
3. 变量的名字 | ||
|
||
通过这些,我们可以使用`Expression.value()`获得某一个变量的计算结果。如果这一个变量没有记算过,(即Data更新了,但是网络没有计算),则可以直接计算出来。 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 记算 ---> 计算 这样,在Net中应该有一个index,来记录当前网络run到了哪个op了。 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
确实可以。但这个index也可以记录在Python端,而不放在Net里面。并且可以在每次feed数据的时候,都重置这个计数器。 |
||
|
||
同时,`Model.run()`也可以接受若干个`Expression`进行计算。这在文档`Model`部分有所描述。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这句话说的有点别扭。。
----> 这篇设计文档讨论Paddle重构后Python模块的设计,对应上图中的红色模块部分。我们需要对重构后的底层C++概念
Operator
,Scope
等进行合理封装;同时要确保兼容直接面向用户的paddle.v2
高层API;