fluid distribute doc #9288

seiriosPlus · 2018-03-21T07:11:39Z

Yancey1989 · 2018-03-21T07:14:08Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+exit(1)
+```
+
+我们创建了一个简单的全连接神经网络程序，并且通过fluid的Executor执行了100次迭代,现在我们需要将该非分布式版本的程序更新为分布式版本的程序。


非分布式版本

==>

单机版本

Yancey1989 · 2018-03-21T07:15:48Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+启动顺序，先启动全部的PSERVER (Parameter Server)后，再启动TRAINER(Trainer)。
+**其中：training_role 是用来区分当前所起服务的角色的，用于训练程序中，用户可根据需要自行定义，其他参数为fluid.DistributeTranspiler的transpile函数所需要，需要在调用函数前进行定义，至于如何从外部环境传入，用户可自定义。**
+
+### DEMO


DEMO

=>

Demo

Yancey1989 · 2018-03-21T07:16:11Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+**其中：training_role 是用来区分当前所起服务的角色的，用于训练程序中，用户可根据需要自行定义，其他参数为fluid.DistributeTranspiler的transpile函数所需要，需要在调用函数前进行定义，至于如何从外部环境传入，用户可自定义。**
+
+### DEMO
+完整的demo代码位于fluid的test目录下的[book](https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_fit_a_line.py)中。


fluid => Fluid

Yancey1989 · 2018-03-21T07:18:37Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+```python
+optimize_ops, params_grads = sgd_optimizer.minimize(avg_cost) 
+```
+将Distributed Transpiler、优化算子 和梯度函数放在一个代码中如下：


优化算子和梯度函数放在

去掉多余的空格。

Yancey1989 · 2018-03-21T07:20:54Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+        exe.run(t.get_trainer_program())
+```
+### 分布式训练脚本运行说明
+分布式任务的运行需要外部指定多个参数：


分布式任务的运行需要外部指定多个参数

感觉这里不是很清楚，是说运行时通过环境变量指定参数还是DistributeTranspiler需要的参数呢？

done，已重新修改，加入了参数说明及代码使用样例

Yancey1989 · 2018-03-21T11:33:41Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+* 可用的集群
+
+    包含一个或多个计算节点的集群，每一个节点都能够执行PaddlePaddle的训练任务且拥有唯一的IP地址，集群内的所有计算节点可以通过网络相互通信。
+* 安装PaddlePaddle Fluid with Distribute 版本


PaddlePaddle Fluid with Distribute

PaddlePaddle Fluid with Distribution

Yancey1989 · 2018-03-21T11:36:50Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+```
+
+## 更新训练脚本
+这里，我们以[Deep Learing 101](http://www.paddlepaddle.org/docs/develop/book/01.fit_a_line/index.html)课程中的第一章 fit a line 为例。


xxx为例。。后面应该有具体做的事情，例如：
描述如何将单机训练脚本改造成支持集群训练的版本。

Yancey1989 · 2018-03-21T11:39:46Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+* 可用的集群
+
+    包含一个或多个计算节点的集群，每一个节点都能够执行PaddlePaddle的训练任务且拥有唯一的IP地址，集群内的所有计算节点可以通过网络相互通信。
+* 安装PaddlePaddle Fluid with Distributed版本


Distributed
=>

Distribution

Yancey1989 · 2018-03-21T11:40:45Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+
+我们创建了一个简单的全连接神经网络程序，并且通过Fluid的Executor执行了100次迭代,现在我们需要将该单机版本的程序更新为分布式版本的程序。
+### 介绍Parameter Server
+在非分布式版本的训练脚本中，只存在Trainer一种角色，它不仅处理常规的计算任务，也处理参数相关的计算和保存任务。在分布式版本的训练过程中，由于存在多个Trainer节点进行同样的数据计算任务，因此需要有一个中心化的节点来统一处理参数相关的保存和分配。在PaddlePaddle中，我们称这样的节点为Parameter Server, [Parameter Server 设计文档](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/fluid/design/dist_train/parameter_server.md)


因此需要有一个中心化的节点来统一处理参数相关的保存和分配

这里关于PServer的简短描述不是很准确，在Fluid中，PServer还负责参数的优化。

Yancey1989 · 2018-03-21T11:43:43Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+| server_endpoint | str | 当前所起的服务节点的IP:PORT | 127.0.0.1:8789 |
+| training_role | str | 节点角色， TRAINER/PSERVER | PSERVER |
+
+**其中：training_role 是用来区分当前所起服务的角色的，用于训练程序中，用户可根据需要自行定义，其他参数为fluid.DistributeTranspiler的transpile函数所需要，需要在调用函数前进行定义，至于如何从外部环境传入，用户可自定义。**


注意：training_role是用来区分当前所起服务的角色的，在训练程序中，用户可根据需要自行定义，其他参数为Distribute Transpiler所需要的参数，样例如下：

Yancey1989 · 2018-03-21T11:44:02Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+    pserver_startup = t.get_startup_program(server_endpoint, pserver_prog)
+```
+
+### 启动顺序


这行可以去掉了，Fluid中对启动顺序无要求。

Yancey1989 · 2018-03-21T11:44:30Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+```
+cd /paddle/python/paddle/fluid/tests/book
+```
+第一步：启动Parameter Server, 启动Parameter Server的命令：


第一步，参考如下命令启动Parameter Server：

Yancey1989 · 2018-03-21T11:44:58Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+cd /paddle/python/paddle/fluid/tests/book
+```
+第一步：启动Parameter Server, 启动Parameter Server的命令：
+```


bash的命令可以用```bash ....```来格式化。

Yancey1989 · 2018-03-21T11:45:38Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+```
+PADDLE_INIT_PORT=6174 PADDLE_INIT_PSERVERS=192.168.1.2 TRAINERS=2 POD_IP=192.168.1.2 PADDLE_INIT_TRAINER_ID=1 TRAINING_ROLE=PSERVER python test_fit_a_line.py
+```
+执行命令后请等待出现提示： ```Server listening on 192.168.1.2:6174 ```


等待提示......，表示Paramter Server已经正常启动。

Yancey1989 · 2018-03-21T11:47:40Z

doc/fluid/howto/cluster/fluid_cluster_train_cn.md

+    包含一个或多个计算节点的集群，每一个节点都能够执行PaddlePaddle的训练任务且拥有唯一的IP地址，集群内的所有计算节点可以通过网络相互通信。
+* 安装PaddlePaddle Fluid with Distribution版本
+
+    所有的计算节点上均需要按照分布式版本的PaddlePaddle, 在用于GPU等设备的机器上还需要额外安装好相应的驱动程序和CUDA的库。


这里需要空一行，否则会和上一行显示在一起。

Yancey1989

LGTM, thanks for porting the chinese usage.

seiriosPlus added 7 commits March 21, 2018 11:35

fluid_cluster_train_cn_doc

e438926

fluid_cluster_train_cn_doc

85db0ae

fluid_cluster_train_cn_doc

7aa48de

fluid_cluster_train_cn_doc

34b7fc7

fluid_cluster_train_cn_doc

5d212da

fluid_cluster_train_cn_doc

b3962a9

fluid_cluster_train_cn_doc

50e8251

seiriosPlus requested a review from Yancey1989 March 21, 2018 07:11

Yancey1989 reviewed Mar 21, 2018

View reviewed changes

seiriosPlus added 4 commits March 21, 2018 16:08

fluid_cluster_train_cn_doc

529878b

fluid_cluster_train_cn_doc

a6b8496

fluid_cluster_train_cn_doc

d42187d

fluid_cluster_train_cn_doc

4ccfc04

Yancey1989 reviewed Mar 21, 2018

View reviewed changes

seiriosPlus added 4 commits March 21, 2018 19:57

fluid_cluster_train_cn_doc

89b9788

fluid_cluster_train_cn_doc

55a5583

fluid_cluster_train_cn_doc

f5eaa32

fluid_cluster_train_cn_doc

b577277

Yancey1989 approved these changes Mar 21, 2018

View reviewed changes

seiriosPlus merged commit 43fac87 into PaddlePaddle:develop Mar 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluid distribute doc #9288

fluid distribute doc #9288

seiriosPlus commented Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 Mar 21, 2018

seiriosPlus Mar 21, 2018

Yancey1989 left a comment

fluid distribute doc #9288

fluid distribute doc #9288

Conversation

seiriosPlus commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 left a comment

Choose a reason for hiding this comment