tfjob status not match when EnableDynamicWorker set true #1452

zw0610 · 2021-10-27T09:18:16Z

In proposal Support ClusterSpec Propagation Feature, the following behaviors are expected:

Worker Failover: If a worker fails (e.g., OOM) or is evicted (e.g., not enough resource), the training continues. Later once the failed worker restarts, it can join the training job dynamically without interupting the training process.

Scale Workers Up/Down: During the training, we can dynamically add/reduce the number of workers on-the-fly based on the needs. This is particularly helpful for online learning -- use more workers during peak time while less during spare time.

However, when testing a tfjob with EnableDynamicWorker set true (controller: v1.3.0):

deleting a worker pod (worker failed) leads to the status of the tfjob turning Failed, leading to the termination of other pods
~~while scaling up (increase worker replicas) succeeds as new worker pod created, scaling down fails (in my perspective) as no worker pod are deleted or evicted~~

The text was updated successfully, but these errors were encountered:

zw0610 · 2021-10-27T09:31:02Z

Could we reach a consensus that when EnableDynamicWorker is set true in a tfjob:

instead of being set to Failed when a Worker pod is missing, this tfjob should remain Running
When the replicas of Worker is changed to a smaller value, Worker pods whose index is larger than the new expectation should be removed.

/cc @kubeflow/wg-training-leads

Scalding down with targeting pod may come as an enhanced feature for the next stage.

gaocegege · 2021-10-27T10:19:29Z

Should we keep such a field here?

I think we can check if the job is elastic with its min and max replicas.

zw0610 · 2021-10-27T13:13:04Z

Should we keep such a field here?

I think we can check if the job is elastic with its min and max replicas.

I would suggest keep this field so far until TensorFlow support elasticity for both ParameterSever and Worker. Otherwise, it's not worth to break API compatibility as TensorFlow only support Worker elastic so far.

Meanwhile, the min and max replicas does not apply here. For a generic operator, min and max replicas do not make sense as they do not specify some state users expect the system to reach. (In PyTorchJob, these two fields will be translated into some environment variable which does not apply here.)

stale · 2022-03-02T09:12:45Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gaocegege added the kind/bug label Oct 28, 2021

zw0610 changed the title ~~tfjob status and worker replica scaling not match when EnableDynamicWorker set true~~ tfjob status not match when EnableDynamicWorker set true Oct 28, 2021

zw0610 mentioned this issue Oct 28, 2021

fix tfjob status when enableDynamicWorker set true #1455

Merged

stale bot added the lifecycle/stale label Mar 2, 2022

stale bot closed this as completed Apr 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tfjob status not match when EnableDynamicWorker set true #1452

tfjob status not match when EnableDynamicWorker set true #1452

zw0610 commented Oct 27, 2021 •

edited

Loading

zw0610 commented Oct 27, 2021

gaocegege commented Oct 27, 2021

zw0610 commented Oct 27, 2021 •

edited

Loading

stale bot commented Mar 2, 2022

tfjob status not match when EnableDynamicWorker set true #1452

tfjob status not match when EnableDynamicWorker set true #1452

Comments

zw0610 commented Oct 27, 2021 • edited Loading

zw0610 commented Oct 27, 2021

gaocegege commented Oct 27, 2021

zw0610 commented Oct 27, 2021 • edited Loading

stale bot commented Mar 2, 2022

zw0610 commented Oct 27, 2021 •

edited

Loading

zw0610 commented Oct 27, 2021 •

edited

Loading