Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do we need a update api in new pserver cclient #2347

Closed
jacquesqiao opened this issue Jun 1, 2017 · 11 comments
Closed

do we need a update api in new pserver cclient #2347

jacquesqiao opened this issue Jun 1, 2017 · 11 comments
Assignees

Comments

@jacquesqiao
Copy link
Member

In this design(https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/pserver_client.md) we don't have a update() interface for update parameter, I guess we want to do this immediately after call paddle_send_grads.

My question is, do we need to add a update function for updating parameter by pserver cclient?

@dzhwinter
Copy link
Contributor

dzhwinter commented Jun 1, 2017

we do not need a update() interface in v1 since the parameter optimizer job is done in pserver side.
I thought that trainer need to provide a update() interface to support future's trainer side optimizer(not implement it now).

@jacquesqiao
Copy link
Member Author

Great, I agree with you that for now we do not need to do optimize locally. Here I mean do we need a update interface in cclient to inform pserver to do update/optimize. Because in the current design, the update/optimize is done by pserver implicitly.

@helinwang
Copy link
Contributor

helinwang commented Jun 1, 2017

For ASGD, pserver will immediately update the parameters once trainer tell pserver gradient. When trainer call get parameter, it will always return the latest model.
For SGD, pserver will wait for all trainers reporting the gradient, and then update the parameters. And when trainer call get parameter, it will block until the update is finished. For fault tolerance, pserver need to control when to update parameter, it will have a timer, when some trainer do not send gradient in time, it will do the update. Trainer do not have this information, so it should not be the one controlling when to perform the update.

From the behavior above, feels like trainer's role is to provide gradient to pserver, and pserver will decide when and how to update the parameters?

@dzhwinter
Copy link
Contributor

dzhwinter commented Jun 1, 2017

  • for the pserver side optimize, I totally keep same idea with yours.
    there is one detail need to figure out when it come to SGD,
    1、how will one trainer report the training epoch is over?
    for example, there is 8 part of data, we have 3 node trainer, obviously, there will be one node lack one batch data. in SGD, will this machine send empty parameter? or just deregister itself from training node map? any other method else?
    2 、when to determine that the whole training process is finished?
    all the machine going to the same epoch count?
    3、how about the lagged node problem, kick out during training process?
    ....
    I think we need these things discussed sufficiently. I thought the synchronizing part will be a bunch of code, should we put it into pserver?

  • another issue, Do we will support trainer side optimize? As far as I know, just for performance reason will we implement this feature in the future?
    https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/cluster_train/pserver_client.md#model-optimization-using-gradients

@helinwang
Copy link
Contributor

helinwang commented Jun 1, 2017

@dzhwinter here is what in my mind:

for example, there is 8 part of data, we have 3 node trainer, obviously, there will be one node lack one batch data. in SGD, will this machine send empty parameter? or just deregister itself from training node map? any other method else?

In this case the trainer lacking data will just timeout, pserver will move on after a timeout threshold has reached.

when to determine that the whole training process is finished?

It's not the responsibility of the trainer or the pserver. The master server will know when a pass or whole training process is finished.

how about the lagged node problem, kick out during training process?

Same as first one, there will be a timeout threshold.

another issue, Do we will support trainer side optimize? As far as I know, just for performance reason will we implement this feature in the future?

I think we already agreed on that our plan for the first version is to only implement trainer side optimization, trainers will send the parameter diff to pservers, and pservers only do simple averaging on pserver.

@dzhwinter
Copy link
Contributor

dzhwinter commented Jun 2, 2017

timeout strategy seems great, it simplifies the coordinate problem!
the timeout threshold will be set to retry times * send data interval, can we afford this time delay overhead? If there is one lagged node(not dead one), then the time delay will be time threshold in each send_grads/update, can we afford that price?

@typhoonzero
Copy link
Contributor

typhoonzero commented Jun 2, 2017

Agree with @helinwang with the update() interface.

In this case the trainer lacking data will just timeout, pserver will move on after a timeout threshold has reached.

Pserver's timeout is different to "master timeouts". A master timeout will mark the task as "failed". Consider trainers lacks batches is training a "small" task, the process may like:

  1. The trainer with "small" task finishes its task earlier than the others
  2. Master will mark this task to "Done" status and dispatch a new task to this trainer
  3. Then the trainer will call "paddle_get_params" to fetch parameters to start a new training batch, because pserver may still be waiting for all trainers to send gradients, so the call may be blocked util parameters are updated on the pserver side.
  4. Cluster goes on training with one trainer training a new task and other trainers training old tasks.
  5. Go back to 1

@jacquesqiao
Copy link
Member Author

jacquesqiao commented Jun 2, 2017

@typhoonzero

Then the trainer will call "paddle_get_params" to fetch parameters to start a new training batch, because master may still be waiting for all trainers to send gradients

master may still be waiting ==> pserver may still be waiting?

@helinwang
Copy link
Contributor

@typhoonzero Yes the process is correct 👍

@helinwang
Copy link
Contributor

helinwang commented Jun 2, 2017

@dzhwinter

the timeout threshold will be set to retry times * send data interval, can we afford this time delay overhead? If there is one lagged node(not dead one), then the time delay will be time threshold in each send_grads/update, can we afford that price?

I think for SGD we expect that all trainer finishes a mini-batch in roughly same time. So timeout will not happen often. However timeout may happen often due to network or other issues. In this case we can use more aggressive timeout, or add backup trainers. Or switch to ASGD.

@helinwang
Copy link
Contributor

helinwang commented Aug 10, 2017

The API have been discussed and reached agreement. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants