-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft about Python Package Design #3569
Draft about Python Package Design #3569
Conversation
|
||
![DOT](http://api.paddlepaddle.org/graphviz?dot=https://raw.githubusercontent.com/reyoung/graphviz_dots/master/refactor/python_arch.dot) | ||
|
||
这篇设计文档讨论Paddle重构后,如何设计Python模块。使面对用户的最高层是我们的`paddle.v2` API,而面对C++的是我们最底层重构后的`Operator`, `Scope`, etc。即上图中红色模块的部分。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这句话说的有点别扭。。
----> 这篇设计文档讨论Paddle重构后Python模块的设计,对应上图中的红色模块部分。我们需要对重构后的底层C++概念Operator
,Scope
等进行合理封装;同时要确保兼容直接面向用户的paddle.v2
高层API;
|
||
同时,Model中的设备信息默认值是空。用户可以一开始设置设备,也可以在运行前设置设备。每一个model只绑定一个设备。如果一个Model已经被运行过(即调用过init_param()或run()),则Model的设备不能再变更。 | ||
|
||
`Model.run()`可以接受结束的`Expression`,这时候Paddle会运行这个模型中需要计算出这个Expression的子图。如果不指定,则运行全部Model。提取子图的过程可以被缓存(cache)住。提取子图算法描述如下: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以先说一下什么是Expression
|
||
Model是训练神经网络需要的一些信息的汇总,其中记录: | ||
|
||
1. 拓扑结构栈(栈是为了实现RNN或IfElseOp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tf 里, session只需要一个, session.run() 方法会创建新的拓扑,比如
session.run([cost0])
会创建以 cost0 为终点的子图拓扑,并runsession.run([cost1])
会创建以 cost1 为终点的子图拓扑,并run
如果这里 model 类似于 tf.session,那应该只需要一个
tf.device('/cpu'):
a = tf.tensor(xxx)
b= tf.tensor(xxz)
tf.device('/gpu:0'):
a0 = tf.tensor(xxx)
b0 = tf.tensor(xxx)
# device bind
tf.device('/gpu:1'):
aa = tf.add(a, a0)
bb = tf.add(b, b0)
session = tf.Session()
# create two subgraph easily
aa_tensor = session.run([aa])
bb_tensor = session.run([bb])
类似这样的写法,Model感觉写起来不自然吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为device目前是Model的成员变量,Model只能支持单device。如果想要跨设备的构建,那就需要多个Model。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Superjom 我们目前不考虑多设备的情况。
|
||
通过用户配置来修改拓扑结构的Model Method需要注意的是: | ||
|
||
1. 要使用`model.cur_net()`和`model.cur_scope()`来获取网络。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果支持 model.run([targets]) , 感觉 net 就不需要暴露出来了吧, tf.session.run 里完全没有类似net这个概念,用户也完全不需要知道拓扑细节,只需要关心自己需要哪一个 target 就可以了
我要 target0 的 tensor 值,就 tf.session.run([target0])
,不需要关心拓扑/Model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model.cur_net()
暴露给 层的开发者。
2. 计算出这个变量的Op在Model中的位置(如果是RNN,就是子图中的位置)。 | ||
3. 变量的名字 | ||
|
||
通过这些,我们可以使用`Expression.value()`获得某一个变量的计算结果。如果这一个变量没有记算过,(即Data更新了,但是网络没有计算),则可以直接计算出来。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
记算 ---> 计算
这样,在Net中应该有一个index,来记录当前网络run到了哪个op了。
在求取Expression值的时候,可以判断一下,是否已经计算过了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在Net中应该有一个index,来记录当前网络run到了哪个op了。 在求取Expression值的时候,可以判断一下,是否已经计算过了。
确实可以。但这个index也可以记录在Python端,而不放在Net里面。并且可以在每次feed数据的时候,都重置这个计数器。
|
||
需要注意的是,Model中并不实现某一层,譬如`fc_layer`是实现在全局函数中,而不是实现在Model类中。 | ||
|
||
同时,Model中的设备信息默认值是空。用户可以一开始设置设备,也可以在运行前设置设备。每一个model只绑定一个设备。如果一个Model已经被运行过(即调用过init_param()或run()),则Model的设备不能再变更。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set device place 这个应该是比较简单的操作吧, 为啥要model级别设置 device呢,一个 net 应该能容纳有不同 device 的op吧,op为级别设置 device 应该更灵活吧
参考 tf.device
:
# sub-graph on gpu 0
tf.device('/gpu:0'):
a = tf.tensor()
b = tf.tensor()
c = tf.add(a, b)
# subgraph
tf.device('/gpu:1'):
d = tf.add(a, c)
# subgraph
tf.device('/gpu:2'):
e = tf.add(c, d)
session = tf.Session()
# a subgraph execute on 3 devices, gpu0, gpu1, gpu2
e = session.run([e])
这样可以灵活设置 device, model 不需要管 topology(自动生成就行), 也不需要设定 device(op 可以单独设置不同的 device)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Superjom 我们目前不考虑多设备的情况。
不过即使考虑,应该也不会向TF如此设计。可能可以分成两种情况
-
数据并行。那么就可以在Model之上再封装一个MultiDeviceModel。每一个设备上独立一个Model。
-
模型并行。每一部分模型是一个独立的Model,再在其之上封装一个ModelParalleler去运行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set device place这个操作并不简单。特别是已经设置过设备的Tensor和Op换一个设备,非常麻烦。并且,设备之间的通信要怎么处理还是得再想想。
|
||
同时,Model中的设备信息默认值是空。用户可以一开始设置设备,也可以在运行前设置设备。每一个model只绑定一个设备。如果一个Model已经被运行过(即调用过init_param()或run()),则Model的设备不能再变更。 | ||
|
||
`Model.run()`可以接受结束的`Expression`,这时候Paddle会运行这个模型中需要计算出这个Expression的子图。如果不指定,则运行全部Model。提取子图的过程可以被缓存(cache)住。提取子图算法描述如下: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“这时候Paddle会运行这个模型中需要计算出这个Expression的子图”
感觉改成下面这样会更通顺:
“Paddle会查找这个Expression所有的依赖项,组成子图,并运行这个子图”
而同理,对于模型的参数,也需要通过`parameter`讲其转换为Expression。方法如下: | ||
|
||
```python | ||
def parameter(name, dim, attr, model=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的dim
好像完全没有被使用?
|
||
# This line could be changed by attr, not only uniform is supported. | ||
# just a demo here | ||
model.init_net.create_and_add_op("uniform_random", **attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是不是应该指定下被初始化的参数名字?也就是name参数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对的,这些参数名字应该会放到attr中。
No description provided.