Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tfjob status not match when EnableDynamicWorker set true #1452

Closed
zw0610 opened this issue Oct 27, 2021 · 4 comments
Closed

tfjob status not match when EnableDynamicWorker set true #1452

zw0610 opened this issue Oct 27, 2021 · 4 comments

Comments

@zw0610
Copy link
Member

zw0610 commented Oct 27, 2021

In proposal Support ClusterSpec Propagation Feature, the following behaviors are expected:

  1. Worker Failover: If a worker fails (e.g., OOM) or is evicted (e.g., not enough resource), the training continues. Later once the failed worker restarts, it can join the training job dynamically without interupting the training process.
  2. Scale Workers Up/Down: During the training, we can dynamically add/reduce the number of workers on-the-fly based on the needs. This is particularly helpful for online learning -- use more workers during peak time while less during spare time.

However, when testing a tfjob with EnableDynamicWorker set true (controller: v1.3.0):

  1. deleting a worker pod (worker failed) leads to the status of the tfjob turning Failed, leading to the termination of other pods
  2. while scaling up (increase worker replicas) succeeds as new worker pod created, scaling down fails (in my perspective) as no worker pod are deleted or evicted
@zw0610
Copy link
Member Author

zw0610 commented Oct 27, 2021

Could we reach a consensus that when EnableDynamicWorker is set true in a tfjob:

  1. instead of being set to Failed when a Worker pod is missing, this tfjob should remain Running
  2. When the replicas of Worker is changed to a smaller value, Worker pods whose index is larger than the new expectation should be removed.

/cc @kubeflow/wg-training-leads

Scalding down with targeting pod may come as an enhanced feature for the next stage.

@gaocegege
Copy link
Member

Should we keep such a field here?

I think we can check if the job is elastic with its min and max replicas.

@zw0610
Copy link
Member Author

zw0610 commented Oct 27, 2021

Should we keep such a field here?

I think we can check if the job is elastic with its min and max replicas.

I would suggest keep this field so far until TensorFlow support elasticity for both ParameterSever and Worker. Otherwise, it's not worth to break API compatibility as TensorFlow only support Worker elastic so far.

Meanwhile, the min and max replicas does not apply here. For a generic operator, min and max replicas do not make sense as they do not specify some state users expect the system to reach. (In PyTorchJob, these two fields will be translated into some environment variable which does not apply here.)

@zw0610 zw0610 changed the title tfjob status and worker replica scaling not match when EnableDynamicWorker set true tfjob status not match when EnableDynamicWorker set true Oct 28, 2021
@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Mar 2, 2022
@stale stale bot closed this as completed Apr 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants