-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PServer recovery from checkpoint #2741
PServer recovery from checkpoint #2741
Conversation
go/pserver/etcd_client.go
Outdated
// GetCheckpoints gets the checkpoint information by the specified pserver id | ||
func (e *EtcdClient) GetCheckpointInfo(string idx) (string, error) { | ||
key := PsCheckpoint + idx | ||
ctx, cancel := context.WithTimeout(context.Background(), time.Second*time.Duration(timeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is timeout
defined, does this code compiles? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry :), fixed it.
go/pserver/etcd_client.go
Outdated
@@ -186,3 +188,20 @@ func (e *EtcdClient) registerPserverEtcd(ctx context.Context) (int, error) { | |||
|
|||
return idx, nil | |||
} | |||
|
|||
// GetCheckpoints gets the checkpoint information by the specified pserver id | |||
func (e *EtcdClient) GetCheckpointInfo(string idx) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename this function to GetKey
, caller can compute and pass in the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
UUID string `json:"uuid"` | ||
MD5 string `json:"md5"` | ||
Timestamp string `json:"timestamp"` | ||
State []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why State and ParameterWithConfig does not have json tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoint
will be initialized from the JSON data in etcd, which only the the key uuid
, md5
and timestamp
, but according with your other comment, separate Checkpoint
and CheckpointInfo
is better way, thanks for @helinwang .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
MD5 string `json:"md5"` | ||
Timestamp string `json:"timestamp"` | ||
State []byte | ||
ParameterWithConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with your comment on the other PR, Checkpoint
need to contain checkpoints for all parameters on the pserver.
I was confused before. I think Checkpoint
should be renamed to CheckpointInfo
, it's actually a reference to the checkpoint file on distributed filesystem. The type of Timestamp
should be changed to int64
.
Perhaps something like:
type CheckpointInfo struct {
UUID string
Path string
MD5 string
Timestamp int64
}
Maybe we can have another type like:
type ParameterCheckpoint struct {
ParameterWithConfig
State []byte
}
type Checkpoint []ParameterCheckpoint
How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I think separate Checkpoint
and CheckpointInfo
is a good idea.I will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename ParameterCheckpoint
to CheckpointData
should be more clear. Then these two struct should be:
type CheckpointMeta struct {}
type CheckpointData struct {}
type Checkpoint struct {
meta CheckpointMeta
data CheckpointData
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, CheckpointMeta
looks like a appropriate name.
Maybe we don't need the struct Checkpoint
including meta
and data
? We can
Json.Umarshal
from etcd to initializeCheckpointMeta
.gob.Decode(...)
from the checkpoint file to initializeCheckpointData
I looks like more simple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@typhoonzero How do you think about this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed with @Yancey1989 . Do we need to separate meta
and data
in checkpoint ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree!
Do we need to separate meta and data in checkpoint ?
Maybe not, It's simple enouph to load meta and data into the struct separately.
go/pserver/service.go
Outdated
MD5 string `json:"md5"` | ||
Timestamp string `json:"timestamp"` | ||
State []byte | ||
ParameterWithConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename ParameterCheckpoint
to CheckpointData
should be more clear. Then these two struct should be:
type CheckpointMeta struct {}
type CheckpointData struct {}
type Checkpoint struct {
meta CheckpointMeta
data CheckpointData
}
go/pserver/service.go
Outdated
} | ||
|
||
// NewCheckpoint creates a new checkpoint. | ||
func NewCheckpoint(idx int, cpPath string, e *EtcdClient) (*Checkpoint, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to NewCheckpointFromFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the offline discuss, I will add Load
and Save
method for the struct Checkponit
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
if err != nil { | ||
return nil, err | ||
} | ||
// TODO: create checkpoint from file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish this TODO and then review.
I may not have internet connection in the following few days. Don't want to block this PR from being merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some part of the design is quite different.
Please review this PR when you in spare time.
https://github.com/PaddlePaddle/Paddle/pull/2716/files
go/pserver/service.go
Outdated
@@ -49,13 +54,95 @@ type Service struct { | |||
optMap map[string]*optimizer | |||
} | |||
|
|||
// CheckpointMeta saves the checkpoint information | |||
type CheckpointMeta struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpointMeta is an inner type which only used by pserver, it should not be a exported type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/cmd/pserver/pserver.go
Outdated
s, err = pserver.NewService(idx) | ||
candy.Must(err) | ||
} else { | ||
s, err = pserver.NewServiceFromCheckpoint(idx, cp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unify these two interface may be better, we can pass an empty checkpoint in default. then we will have only one create entry and leave out the duplicate code.
s := &Service{
idx: idx,
}
s.optMap = make(map[string]*optimizer)
s.initialized = make(chan struct{})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thx!
go/pserver/service.go
Outdated
} | ||
|
||
// ParameterCheckpoint saves parameter checkpoint | ||
type ParameterCheckpoint struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as Meta
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
|
||
// ParameterCheckpoint saves parameter checkpoint | ||
type ParameterCheckpoint struct { | ||
ParameterWithConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there in only type here without variable name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According with: https://golang.org/ref/spec#Struct_types, a field without a explicit field name is an anonymous field, so I can initialize the Param
filed of ParameterWithConfig
as ParameterCheckpoint.Param = xxx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get it. a brilliant feature of Go.
go/pserver/service.go
Outdated
|
||
dec := gob.NewDecoder(f) | ||
|
||
if err = dec.Decode(&cp.data); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check md5 value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
} | ||
|
||
// Checkpoint saves all parameters' checkpoint | ||
type Checkpoint struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe these things should not be the member of Checkpoint
. They do not persist in permanent storage. I move them into Server's argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Timestamp string `json:"timestamp"` | ||
} | ||
|
||
// Checkpoint is the pserver shard persist in file | ||
type Checkpoint []ParameterCheckpoint | ||
type Checkpoint []parameterCheckpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am following @typhoonzero 's comment here for possible usage.
#2716 (comment)
But it is true, single parameterCheckpoint is useless to the user. It should be use immediately out of pserver's package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fix this if something error for the later.
go/pserver/service.go
Outdated
} | ||
md5 := hex.EncodeToString(h.Sum(nil)) | ||
if md5 != cpMeta.MD5 { | ||
return nil, errors.New("md5 does match, load checkpoint failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to put the const error type together? see the line of L28
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/etcd_client.go
Outdated
@@ -189,6 +189,22 @@ func (e *EtcdClient) registerPserverEtcd(ctx context.Context) (int, error) { | |||
return idx, nil | |||
} | |||
|
|||
// GetKey gets the value by the specified key | |||
func (e *EtcdClient) GetKey(key string, timeout int) ([]byte, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more clear if the type of timeout
is time.Duration
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
@@ -26,6 +27,8 @@ const ( | |||
AlreadyInitialized = "pserver already initialized" | |||
// Uninitialized is true if pserver not fully initialized | |||
Uninitialized = "pserver not fully initialized" | |||
// CheckpointUnmatched is true if the md5 of checkpoint file does matched the older | |||
CheckpointUnmatched = "checkpoint file does not matched the older" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe CheckpointMD5Failed = "checkpoint file MD5 validation failed"
would better express the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
@@ -26,6 +27,8 @@ const ( | |||
AlreadyInitialized = "pserver already initialized" | |||
// Uninitialized is true if pserver not fully initialized | |||
Uninitialized = "pserver not fully initialized" | |||
// CheckpointUnmatched is true if the md5 of checkpoint file does matched the older |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "is true" is not very appropriate in English grammar, it's string
type, "is true" is for bool
type. Since the content of these constants already express the meaning, maybe we don't need the comment?
Perhaps we can put a comment on the top of const
keyword:
// RPC error messages.
const (
AlreadyInitialized = "pserver already initialized"
Uninitialized = "pserver not fully initialized"
CheckpointUnmatched = "checkpoint file does not matched the older"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
type checkpointMeta struct { | ||
UUID string `json:"uuid"` | ||
Md5sum string `json:"md5sum"` | ||
MD5 string `json:"md5"` | ||
Timestamp string `json:"timestamp"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is not related to this PR.
Timestamp
should have type int64
, it's value should be from func (t Time) UnixNano() int64
.
Having string
here makes it not precise enough (精度不够). time.Now().String()
is only accurate to second: 2009-11-10 23:00:00 +0000 UTC
.
Can you add a TODO: change type of timestamp to int64
here? (or would be great if you could fix it 👍 ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
return nil, errors.New(CheckpointUnmatched) | ||
} | ||
|
||
dec := gob.NewDecoder(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since io.Copy
already read through f
once, gob.NewDecoder(f)
requires to read from the beginning of the file, but f
have not been seek
to the beginning of the file. Does this actually work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's a bug...
Done.
go/pserver/service.go
Outdated
// endpoints specified. | ||
func NewService(idx int, seconds int, path string, client *EtcdClient, cp Checkpoint) (*Service, error) { | ||
// endpoints specified. It will recovery from checkpoint file if a exists a specified checkpoint. | ||
func NewService(idx int, interval int, path string, client *EtcdClient, cp *Checkpoint) (*Service, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it's not related to this PR. I know etcd uses int
to express the type of timeout, we should really use type time.Duration
for interval
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
@@ -186,13 +231,13 @@ func (s *Service) doCheckpoint() error { | |||
s.mu.Lock() | |||
defer s.mu.Unlock() | |||
|
|||
cp := make([]ParameterCheckpoint, 0, len(s.optMap)) | |||
cp := make([]parameterCheckpoint, 0, len(s.optMap)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is not related to this PR.
This does not work. It creates a slice with length set to 0 and capacity set to len(s.optMap)
. access cp[0]
will crash the program since it's length is 0. Should use cp := make([]parameterCheckpoint, len(s.optMap))
here.
For example:
package main
func main() {
a := make([]int, 0, 10)
for i := 0; i < 10; i++ {
a[i] = i
}
}
Output:
panic: runtime error: index out of range
goroutine 1 [running]:
main.main()
/tmp/sandbox064034874/main.go:6 +0x20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @helinwang , done.
go/cmd/pserver/pserver.go
Outdated
if err != nil { | ||
panic(err) | ||
log.Infof("Fetch checkpoint failed, %s\n", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe log.Errorf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/cmd/pserver/pserver.go
Outdated
if err != nil { | ||
panic(err) | ||
} | ||
s, err := pserver.NewService(idx, time.Duration(*checkpointInterval)*time.Second, *checkpointPath, e, cp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can change flag.Int("checkpoint-interval", 600, "save checkpoint per interval seconds")
to flag.Duration("checkpoint-interval", 600*time.Second, "save checkpoint interval")
, and not do type cast here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
|
||
cpMetajson, _ := json.Marshal(cpMeta) | ||
err = s.client.PutKey(filepath.Join(PsCheckpoint, strconv.Itoa(s.idx)), cpMetajson, 3) | ||
err = s.client.PutKey(filepath.Join(PsCheckpoint, strconv.Itoa(s.idx)), cpMetajson, time.Duration(3)*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change to 3*time.Second
. Constant in Go is well designed: https://blog.golang.org/constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/pserver/service.go
Outdated
} | ||
} | ||
return s, nil | ||
} | ||
|
||
// InitParam initializes a parameter. | ||
func (s *Service) InitParam(paramWithConfigs ParameterWithConfig, dummy *int) error { | ||
func (s *Service) InitParam(paramWithConfigs *ParameterWithConfig, dummy *int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is passing pointer * ParameterWithConfig
necessary? Please see: https://github.com/golang/go/wiki/CodeReviewComments#pass-values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, and here is also some useful documents about pass by value of pointer,
go/pserver/optimizer.go
Outdated
@@ -35,7 +35,7 @@ func cArrayToSlice(p unsafe.Pointer, len int) []byte { | |||
return (*[1 << 30]byte)(p)[:len:len] | |||
} | |||
|
|||
func newOptimizer(paramWithConfigs ParameterWithConfig, State []byte) *optimizer { | |||
func newOptimizer(paramWithConfigs *ParameterWithConfig, State []byte) *optimizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is passing pointer * ParameterWithConfig
necessary? Please see: https://github.com/golang/go/wiki/CodeReviewComments#pass-values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM++ Thanks!!!
Fixed #2671
Related: Pserver save state #2716