-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"add part of trainer design doc" #2363
Conversation
|
||
## Synchronize SGD | ||
|
||
In synchronize SGD, trainer need to wait other nodes finish training in every minibatch. And don't go on next epoch training if there is any node lag behind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some terms we usually use:
- step: one forward backward step, computes gradient.
- mini-batch: several data instances used in a single step.
- task: multiple mini-batches, the master server assigns task to trainers.
- pass: all training data, consisted of multiple tasks.
- epoch: start of a new pass.
In this line, "epoch" is used with "mini-batch", I think by "epoch" you actually mean "step"?
|
||
<img src="src/paddle-trainer.png" width="600"/> | ||
|
||
To wait other trainer in same epoch, use the waitEpochFinish to decide if an epoch has finished and enter next training epoch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trainer does not need to know about epoch (start of a new pass), it just get task from the master. So I think waitEpochFinish
is not necessary.
|
||
## Event Handler | ||
|
||
To select the trainer for process Python client event, same way as initialization parameters. Every trainer will try to get a distribute lock, then election a leader one. Leader trainer will keep to writing a file/ send metric data to evaluatorServer. Then python client can use that data draw metrics in real time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "Event Handler" section is too early to be put into a design doc (we have not reached consensus yet).
Please see: #2364 (comment)
No description provided.