-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tfjob status not match when EnableDynamicWorker set true #1452
Comments
Could we reach a consensus that when
/cc @kubeflow/wg-training-leads Scalding down with targeting pod may come as an enhanced feature for the next stage. |
Should we keep such a field here? I think we can check if the job is elastic with its min and max replicas. |
I would suggest keep this field so far until TensorFlow support elasticity for both Meanwhile, the min and max replicas does not apply here. For a generic operator, min and max replicas do not make sense as they do not specify some state users expect the system to reach. (In PyTorchJob, these two fields will be translated into some environment variable which does not apply here.) |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
In proposal Support ClusterSpec Propagation Feature, the following behaviors are expected:
However, when testing a tfjob with
EnableDynamicWorker
set true (controller: v1.3.0):Failed
, leading to the termination of other podswhile scaling up (increase worker replicas) succeeds as new worker pod created, scaling down fails (in my perspective) as no worker pod are deleted or evictedThe text was updated successfully, but these errors were encountered: