Design doc: master server #1953

helinwang · 2017-05-02T01:54:07Z

Maybe here is easier to review.

wangkuiyi · 2017-05-02T02:22:23Z

doc/design/cluster_train/data_dispatch.md

@@ -21,7 +21,7 @@

 ### 文件预处理

-在数据集可以被训练之前，文件需要预先被转换成PaddlePaddle集群内部的存储格式（SSTable）。我们提供两个转换方式：
+在数据集可以被训练之前，文件需要预先被转换成PaddlePaddle集群内部的存储格式（RecordIO）。我们提供两个转换方式：

 - 提供给用户本地转换的库，用户可以编写程序完成转换。


用户在本地转换好再上传

用户上传数据后，在机群上运行转换程序

wangkuiyi · 2017-05-02T02:25:20Z

doc/design/cluster_train/master_process.md

@@ -0,0 +1,89 @@
+# Design Doc: Master Process
+
+For an overview of master process' role, please refer to [distributed training design doc](./README.md). In this design doc we will discuss the master process in more details. The master will be implemented in [golang](https://golang.org/).


in golang ==> in Go

wangkuiyi · 2017-05-02T02:27:17Z

doc/design/cluster_train/master_process.md

+
+<img src="src/dataset.png"/>
+
+A dataset is represented by a list of files in *RecordIO* format on the distributed filesystem, each RecordIO file consists of multiple *blocks*, and each block has multiple data instances.


A dataset is represented by a list of files in RecordIO format on the distributed filesystem,

==>

A dataset is a list of files in RecordIO format.

A dataset is itself a list of files, not representationally.

each RecordIO file consists of multiple blocks, and each block has multiple data instances.

==>

A RecordIO file consists of chunks, whereas each chunk consists some records.

It is chunks, not blocks. And RecordIO files consist of records. It's us/PaddlePaddle who take records as instances.

wangkuiyi · 2017-05-06T03:38:54Z

doc/design/cluster_train/master_process.md

+
+### Task Queue Creation
+
+1. Each trainer will make an RPC call (using [golang rpc](https://golang.org/pkg/net/rpc/)) to the master process, telling it the RecordIO files representing the dataset specified by the user. Since every trainer will tell the master process the same dataset, only the first RPC call will be honored.


golang => Go's rpc package

typhoonzero · 2017-05-06T04:23:04Z

doc/design/cluster_train/master_process.md

+
+	The RPC interface is:
+	```go
+	func (m *RPCServer) ReportDataset(Paths []string, dummy *int) error {


Can we define the "dataset" to a list of files under the same path? So here we may not need Paths []string

How about change Paths []string to Path string, such as /home/random_images-*-of-*

@Yancey1989 please see #1953 (comment) - Helin

The client need to implement the parsing logic for wildcards: user can train RecordIO files locally on their computer, there is no master when training locally.
Since the client already implemented it, I recommend us no to implement it again in the server changing Paths []string to Path string.

typhoonzero · 2017-05-06T04:26:14Z

doc/design/cluster_train/master_process.md

+
+The task queues need to be persisted on [etcd](https://github.com/coreos/etcd) for fault recovery. Since the task queues only change once a task is completed or timed out, which is not very frequent, we can afford to synchronize with etcd every time the task queues change.
+
+We will serialize the task queues data structure with [gob encoding](https://golang.org/pkg/encoding/gob/), compress with gzip, and save into etcd synchronously under key `/task_queues`.


Since there are 2 copies of task queue data: one in etcd, one in master process's memory. so the task queue guarantees "at least once" data dispatch. Which means there will be no data loss, but some data may be replayed to trainers.

This should be mentioned.

I thought the "at least once" data dispatch is due to the possibility of task timeout, and master dispatches task again for retry. Can you explain more on why having "2 copies of task queue data" will cause "at least once" data dispatch?

Yes, you are right! Timouts brings "at least once" data garantee. Also "2 copies of task queue data" will also make it possible:

In master process may encounter:

def dispatch_task(): task = taskQueues.Todo.dequeue() taskQueues.Pending[task] = new taskState # if master goes down here taskQueues.writeToEtcd()

If master goes down before syncing data to etcd, then when master restarted by Kubernetes and load queue data from etcd, then it will dispatch the task again.

@typhoonzero The trainer's RPC call will fail in that case, and it will retry after master restarts. When the master restarts, the tasks are all in todo queue, so not re-dispatched. And when the trainer retries, the task will be marked done. So I think in this case "2 copies of task queue data" will not happen.

Got it!

So are you adding some notices for "at least once" at the and of the "timeout" section?

Good idea! Done.

Yancey1989 · 2017-05-06T16:00:49Z

doc/design/cluster_train/data_dispatch.md


 ```python
 # ...
-reader = paddle.reader.creator.SSTable("/home/random_images-*-of-*")
+reader = paddle.reader.creator.RecordIO("/home/random_images-*-of-*")


如果是个云端的目录，是不是改为"pfs://home/random_images--of-"比较清楚?

I think pfs:// only have to be a concept for the fileserver client to distinguish between remote and local.
Since we always mount the user home directory on the same place, users can just think it as local directory. I think there is no need to add the concept of pfs for user to understand. More concept means steeper learning curve for users.

I second both of you @helinwang and @Yancey1989 that from the perspective of user program running in a Pod, it is only I/O with the local filesystem, as

the home directory should have been mapped to the Pod-local directory /home, and

some shared directories, e.g., the pre-downloaded paddle.v2.dataset data, should have been mapped to the Pod-local directory /common.

and from the perspective of our client tool paddle, it has to refer to files in the distributed filesystem in a special format. But

I don't prefer pfs:///home/$USER/cifa/..., because if we are going to be compatible with the URL standard, we need 3 /s in above URL.

Instead, I prefer /pfs/$DATACENTER/home/$USER/cifa/..., which has been used in Google.

@gongweibao Please take a look at @wangkuiyi 's comment, /pfs/ seems better than pfs:// when using the command line tool.

Agree with /pfs, thank @wangkuiyi @helinwang !

我有个地方不太理解：

pfs:///home/$USER/cifa/...

我们的命令中，比如mv:
mv [OPTION]... <LocalPath> <PFSPath> or <PFSPath> <LocalPath> or <PFSPath> <PFSPath>

如果用/pfs，而恰好用户有一个这样的目录，我们就没法区分local还是remote了

@gongweibao 我觉得就不用管了，hardcode如果/pfs/开头，就认为是远程。

Yancey1989 · 2017-05-06T16:35:09Z

doc/design/cluster_train/master_process.md

+
+	The RPC interface is:
+	```go
+	func (m *RPCServer) ReportDataset(Paths []string, dummy *int) error {


How about change Paths []string to Path string, such as /home/random_images-*-of-*

@Yancey1989 please see #1953 (comment) - Helin

Yancey1989 · 2017-05-06T16:57:53Z

doc/design/cluster_train/master_process.md

+
+### Task Retry Logic
+
+When a task is dispatched to the trainer, the master will schedule a function for execution after the timeout duration (based on the moving average of task completion time). If the task entry in still in the pending queue, its timeout counter will increase by one, and the task will be moved to todo queue. If the timeout counter is above the threshold, the master will log the error and discard the task.


For the timeout, there will be at least two possible cases: Network Split and Slow Task, for the first case Kubernetes will recovery the trainer process, for the second one, Master process should tell the Trainer Stop The Task, so does the trainer also need an RPC interface?

How about just allow the trainer to continue work on the slow task :) (we can tolerate a task being trained twice since SGD is a stochastic algorithm.)

If a task being trained twice, the DONE queue will have many same tasks, does the DONE queue use map[int] TaskEntry instead of []TaskEntry ?

@Yancey1989 Good question! I think we can implement as when receiving "task done" message, check if the task is in the pending queue. If not, just ignore. So there will not be any duplicate task inside done queue.

wangkuiyi

LGTM

Design doc: master process

9572e11

helinwang requested review from Yancey1989, wangkuiyi, jacquesqiao, dzhwinter, gongweibao and typhoonzero May 2, 2017 01:54

wangkuiyi reviewed May 6, 2017

View reviewed changes

typhoonzero reviewed May 6, 2017

View reviewed changes

Yancey1989 reviewed May 6, 2017

View reviewed changes

Helin Wang added 3 commits May 8, 2017 17:42

fix according to comments

191a326

fix according to comments

adb6d43

change master process to master server process

a6f248f

helinwang changed the title ~~Design doc: master process~~ Design doc: master server May 9, 2017

file rename

7a78e02

gongweibao mentioned this pull request May 10, 2017

Modify path format and file format #2083

Merged

wangkuiyi approved these changes May 12, 2017

View reviewed changes

helinwang merged commit 4764303 into PaddlePaddle:develop May 12, 2017

helinwang deleted the master_design branch May 12, 2017 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design doc: master server #1953

Design doc: master server #1953

helinwang commented May 2, 2017 •

edited

Loading

wangkuiyi May 2, 2017

helinwang May 9, 2017

wangkuiyi May 2, 2017

helinwang May 9, 2017

wangkuiyi May 2, 2017

wangkuiyi May 2, 2017

helinwang May 9, 2017

wangkuiyi May 6, 2017

helinwang May 9, 2017

typhoonzero May 6, 2017

Yancey1989 May 6, 2017 •

edited by helinwang

Loading

helinwang May 7, 2017

typhoonzero May 6, 2017

helinwang May 7, 2017

typhoonzero May 7, 2017 •

edited

Loading

helinwang May 9, 2017 •

edited

Loading

typhoonzero May 9, 2017

helinwang May 9, 2017

Yancey1989 May 6, 2017

helinwang May 7, 2017 •

edited

Loading

wangkuiyi May 9, 2017

helinwang May 9, 2017

Yancey1989 May 9, 2017

gongweibao May 9, 2017 •

edited

Loading

helinwang May 9, 2017 •

edited

Loading

Yancey1989 May 6, 2017 •

edited by helinwang

Loading

Yancey1989 May 6, 2017

helinwang May 7, 2017

Yancey1989 May 9, 2017

helinwang May 9, 2017

Yancey1989 May 9, 2017

wangkuiyi left a comment

		@@ -0,0 +1,89 @@
		# Design Doc: Master Process

		For an overview of master process' role, please refer to [distributed training design doc](./README.md). In this design doc we will discuss the master process in more details. The master will be implemented in [golang](https://golang.org/).


		<img src="src/dataset.png"/>

		A dataset is represented by a list of files in RecordIO format on the distributed filesystem, each RecordIO file consists of multiple blocks, and each block has multiple data instances.


		### Task Queue Creation

		1. Each trainer will make an RPC call (using [golang rpc](https://golang.org/pkg/net/rpc/)) to the master process, telling it the RecordIO files representing the dataset specified by the user. Since every trainer will tell the master process the same dataset, only the first RPC call will be honored.


		The task queues need to be persisted on [etcd](https://github.com/coreos/etcd) for fault recovery. Since the task queues only change once a task is completed or timed out, which is not very frequent, we can afford to synchronize with etcd every time the task queues change.

		We will serialize the task queues data structure with [gob encoding](https://golang.org/pkg/encoding/gob/), compress with gzip, and save into etcd synchronously under key `/task_queues`.


		### Task Retry Logic

		When a task is dispatched to the trainer, the master will schedule a function for execution after the timeout duration (based on the moving average of task completion time). If the task entry in still in the pending queue, its timeout counter will increase by one, and the task will be moved to todo queue. If the timeout counter is above the threshold, the master will log the error and discard the task.

Design doc: master server #1953

Design doc: master server #1953

Conversation

helinwang commented May 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yancey1989 May 6, 2017 • edited by helinwang Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero May 7, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang May 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

helinwang May 7, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gongweibao May 9, 2017 • edited Loading

Choose a reason for hiding this comment

helinwang May 9, 2017 • edited Loading

Choose a reason for hiding this comment

Yancey1989 May 6, 2017 • edited by helinwang Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangkuiyi left a comment

Choose a reason for hiding this comment

helinwang commented May 2, 2017 •

edited

Loading

Yancey1989 May 6, 2017 •

edited by helinwang

Loading

typhoonzero May 7, 2017 •

edited

Loading

helinwang May 9, 2017 •

edited

Loading

helinwang May 7, 2017 •

edited

Loading

gongweibao May 9, 2017 •

edited

Loading

helinwang May 9, 2017 •

edited

Loading

Yancey1989 May 6, 2017 •

edited by helinwang

Loading