-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do we need a update api in new pserver cclient #2347
Comments
we do not need a update() interface in v1 since the parameter optimizer job is done in pserver side. |
Great, I agree with you that for now we do not need to do optimize locally. Here I mean do we need a |
For ASGD, pserver will immediately update the parameters once trainer tell pserver gradient. When trainer call get parameter, it will always return the latest model. From the behavior above, feels like trainer's role is to provide gradient to pserver, and pserver will decide when and how to update the parameters? |
|
@dzhwinter here is what in my mind:
In this case the trainer lacking data will just timeout, pserver will move on after a timeout threshold has reached.
It's not the responsibility of the trainer or the pserver. The master server will know when a pass or whole training process is finished.
Same as first one, there will be a timeout threshold.
I think we already agreed on that our plan for the first version is to only implement trainer side optimization, trainers will send the parameter diff to pservers, and pservers only do simple averaging on pserver. |
timeout strategy seems great, it simplifies the coordinate problem! |
Agree with @helinwang with the
Pserver's timeout is different to "master timeouts". A master timeout will mark the task as "failed". Consider trainers lacks batches is training a "small" task, the process may like:
|
|
@typhoonzero Yes the process is correct 👍 |
I think for SGD we expect that all trainer finishes a mini-batch in roughly same time. So timeout will not happen often. However timeout may happen often due to network or other issues. In this case we can use more aggressive timeout, or add backup trainers. Or switch to ASGD. |
The API have been discussed and reached agreement. Closing this issue. |
In this design(https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/pserver_client.md) we don't have a
update()
interface for update parameter, I guess we want to do this immediately after callpaddle_send_grads
.My question is, do we need to add a
update
function for updating parameter by pserver cclient?The text was updated successfully, but these errors were encountered: