-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkpoint m2: pserver checkpoint about lookup table #11410
Comments
#Checkpoint_M2_Plan Checkpoint 目录结构说明checkpoint_dir M1 目标 支持Trainer端Save, 支持Trainer/Pserver端Load##M2 目标 支持Lookup table 在PServer端的Save/Load Design DocSAVE 阶段
LOAD阶段
|
When trainer0 sends the message for all pservers to save parameters, the other trainers are still training. We need to make sure it's ok for other trainers to wait (in sync mode) for the save, or for different pservers to save different versions of parameters (in async mode). |
For the directory layout: If we only keep only persistent variables in the folder. During dynamic scaling or recovering, we can re-arrange the parameter assignments |
@panyx0718 |
It is an awesome suggestion, I will discuss how to design it with other partners. |
Update Checkpoint directory structure:
|
@panyx0718 @seiriosPlus I think it's hard to make pserver distributed table "elastic": when the table is large, rehashing or re-distribute all keys will take too much time since the checkpoint is saved as files on some distribute filesystem, we can not randomly access the keys in those files, reordering may spend even more time than training the daily incremental data. For a large distributed table, it's sort of a "best practice" to not rehashing them but only recover them when needed. |
I agree, and if need to do rehashing, we can do it using an independent tool offline before recover, it will make the design much simpler. Most of the time, for a very large scale sparse training task, the parameter server number will be fixed. |
Sounds good. I thought you guys want to do elastic in the future. |
Checkpoint M2: Save lookup table on PServer.
The text was updated successfully, but these errors were encountered: