-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add TaskFail interface #2719
add TaskFail interface #2719
Conversation
} | ||
|
||
type taskQueues struct { | ||
Todo []taskEntry | ||
Pending map[int]taskEntry // map from task ID to task entry | ||
Done []taskEntry | ||
Failed []Task | ||
Failed []taskEntry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
把Task
改成了taskEntry
的原因是,我觉得Failed task
应该保留进入错误队列时候的上下文状态。
go/master/service.go
Outdated
NumTimeout int | ||
Task Task | ||
NumFailed int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can remove NumTimeout
, we can treat timeout as failure. We can add NumTimeout back if we really need it. Let's keep the system simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
go/master/service.go
Outdated
@@ -34,29 +34,30 @@ type Chunk struct { | |||
// Task is the basic unit of data instances assigned to trainers. | |||
type Task struct { | |||
ID int | |||
Epoch int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Task分发给Trainer的一个要做的事情,Trainer貌似不需要知道这是第几个Epoch。
我大概明白为什么需要Epoch,为的是唯一标示一个Task,不然多个机器可能会有同一个Task的不同Epoch,伟宝考虑的周到!
这样看来的话,是不是考虑每个Task分发出去的时候就给一个新的ID就好?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我添加了一个ISSUE:#2752
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉用全局唯一的ID表示Task就可以了,最终可以统计这个task被执行过几次,哪些成功哪些失败了。Epoch用来对timeout计数可以放在taskEntry
中就可以了。
任务结束时,就可以统计task的成功和失败的情况。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在ISSUE中讨论这个问题吧。已经回了。
go/master/service.go
Outdated
@@ -91,11 +92,11 @@ func partition(chunks []Chunk, chunksPerTask int) []taskEntry { | |||
} | |||
|
|||
// NewService creates a new service. | |||
func NewService(store Store, chunksPerTask int, timeoutDur time.Duration, timeoutMax int) (*Service, error) { | |||
func NewService(store Store, chunksPerTask int, timeoutDur time.Duration, failortimeoutMax int) (*Service, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failortimeoutMax改成failureMax吧(前面有个comment提到删掉NumTimeout)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/master/service.go
Outdated
@@ -389,3 +396,23 @@ func (s *Service) TaskFinished(taskID int, dummy *int) error { | |||
} | |||
return err | |||
} | |||
|
|||
// TaskFailed tell the service that a task is failed. | |||
func (s *Service) TaskFailed(taskID int, epoch int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一个"net/rpc"的函数需要第二个参数的类型是指针: https://golang.org/pkg/net/rpc/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Done.
go/master/service.go
Outdated
return err | ||
} | ||
|
||
s.checkTaskStatus(t, epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
从逻辑上来讲,上报task失败之后,再调用checkTaskStatus
(检查task状态)有点奇怪,都已经失败了,为啥还需要检查?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkTaskStatus
实际做的是如果任务超时,从pending删除加到TODO中。这里我理解如果是client调用TaskFailed
应该不需要调用checkTaskStatus
,而直接从pending删除,加到Failed队列中。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gongweibao 解释了需要重试的原因。明白了,这里的确需要调用checkTaskStatus
重试几次task后fail掉
I may not have internet connection in the following few days. Don't want to block this PR from being merged.
go/master/service.go
Outdated
return err | ||
} | ||
|
||
s.checkTaskStatus(t, epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkTaskStatus
实际做的是如果任务超时,从pending删除加到TODO中。这里我理解如果是client调用TaskFailed
应该不需要调用checkTaskStatus
,而直接从pending删除,加到Failed队列中。
go/master/service.go
Outdated
return err | ||
} | ||
|
||
s.checkTaskStatus(t, epoch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gongweibao 解释了需要重试的原因。明白了,这里的确需要调用checkTaskStatus
重试几次task后fail掉
|
||
t.NumTimeout++ | ||
if t.NumTimeout > s.timeoutMax { | ||
log.Warningf("Task %v timed out %d times, discard.", t.Task, t.NumTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可能也会被failed调用,所以不一定都是time out,可以用泛化点的描述。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -348,9 +354,9 @@ func (s *Service) GetTask(dummy int, task *Task) error { | |||
} | |||
|
|||
*task = t.Task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete unused *task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这行是返回值,是有用的。:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据#2752 (comment) ,Master是不是要加一个定期检查Pending队列的Task是不是超出执行时间的机制呢,有可能由于网络原因,Trainer永远无法汇报自己的Task是不是失败,这样的话会导致Task永远在Pending队列里了。
@Yancey1989 现在已经实现这样的机制。 |
ISSUE #2752 已经回复了。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
go/master/service.go
Outdated
@@ -257,6 +258,34 @@ func (s *Service) SetDataset(globPaths []string, dummy *int) error { | |||
return nil | |||
} | |||
|
|||
func (s *Service) procFailedTask(t taskEntry, epoch int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
请问proc是什么意思?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/master/service.go
Outdated
@@ -34,27 +34,28 @@ type Chunk struct { | |||
// Task is the basic unit of data instances assigned to trainers. | |||
type Task struct { | |||
ID int | |||
Epoch int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change like this is more consistent with func (s *Service) TaskFailed(taskID TaskID, dummy *int) error
:
type TaskMeta struct {
ID int
Epoch int
}
type Task struct {
TaskMeta Meta
Chunks []Chunk
}
func (s *Service) TaskFailed(meta TaskMeta, dummy *int) error {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/master/service.go
Outdated
@@ -348,9 +354,9 @@ func (s *Service) GetTask(dummy int, task *Task) error { | |||
} | |||
|
|||
*task = t.Task | |||
log.WithFields(s.logFields()).Infof("Task #%d dispatched.", task.ID) | |||
log.WithFields(s.logFields()).Infof("Task #%v dispatched.", t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A task contains []Chunks
, maybe that is too much information to print.
Perhaps change to log.WithFields(s.logFields()).Infof("Task #%v dispatched.", t.Task.Meta)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/master/service.go
Outdated
if !ok { | ||
err := errors.New("pending task not found") | ||
log.WithFields(s.logFields()).Warningln("TaskFailed:Pending task #%v not found.", taskID) | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return error here? I think it's normal if that task is no longer pending (completed by other workers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/master/client.go
Outdated
@@ -112,6 +112,11 @@ func (c *Client) taskFinished(taskID int) error { | |||
return c.conn.Call("Service.TaskFinished", taskID, nil) | |||
} | |||
|
|||
// TaskFailed tell the master server as task is failed. | |||
func (c *Client) taskFailed(taskID TaskID) error { | |||
return c.conn.Call("Service.TaskFinished", taskID, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "Service.TaskFailed" :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
汗。Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加了测试用例。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed so many mistakes when reviewing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@typhoonzero No worries, that's why we have multiple developers reviewing, we make mistake :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
汗死。我应该加unit test
。盲目的自信是不可以有的。也不符合我们的做事方法。惭愧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++
No description provided.