-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix https://github.com/kubeflow/training-operator/issues/1704 #1705
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
@@ -133,7 +133,8 @@ func (jc *MPIJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct | |||
} | |||
|
|||
if err = kubeflowv1.ValidateV1MpiJobSpec(&mpijob.Spec); err != nil { | |||
logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String()) | |||
logger.Error(err, "MPIJob failed validation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to add req.NamespacedName.String() as well in the log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think add req.NamespaceName.String() is redundant and strange as a value of KV.
As following shows, the first is from logger.Info(err.Error(), "MPIJob failed validation", req.NamespacedName.String())
, the second is from logger.Error(err, "MPIJob failed validation")
.
By the way, this also belongs to the log format problem that I think we should to optimize
1.6717681860476494e+09 INFO PyTorchReplicaType is Master2 but must be one of [Master Worker] {"pytorchjob": "default/pytorch-test-validate", "PyTorchJob failed validation": "default/pytorch-test-validate"}
1.6717681860476797e+09 ERROR PyTorchJob failed validation {"pytorchjob": "default/pytorch-test-validate", "error": "PyTorchReplicaType is Master2 but must be one of [Master Worker]"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
@HeGaoYuan Can you sign the CLA? |
89e6968
to
250def7
Compare
@johnugeorge what is your suggestion about event reason constant. I found the event reason constant is a little "messy". I am sorry that I am a code clean freak.
|
One problem that i see is, If we return error after validation failure, the job will be still reconciled continuously though not recoverable. Should we mark job as failed? /cc @gaocegege |
/cc @tenzen-y |
Yes, I also notice this problem. If you want to mark the job as failed, it is like about the problem I said "state transition table". As I said now the "state transition table" is now clear, so we should be careful to add new "state transition". Reconciled continuously is common and not a big problem? We can decide it later when we conclude the "state transition table"? |
Can we create an issue to track? The validation failure is non recoverable error and I don't see any value in wasting resources to do continuous reconciliation. We may track it in a different PR. Others, thoughts ? /cc @kubeflow/wg-training-leads @kubeflow/common-team |
@johnugeorge Does that mean, in the following validation step, the training-operator should mark training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go Lines 155 to 157 in 69813fb
|
Some inconsistencies happened because operators from multiple repos were merged into training operator couple of releases ago. we can use |
I see. I agree with using |
If we don't return error for ValidationError , reconciliation won't happen again. Is there a better solution? |
A possible another solution is not |
Yeah. I referred to that earlier. We should do that as this error is non recoverable anyways |
If you recomment to use |
@johnugeorge Another better option; if a validation error occurs, add a special annotation to the target CRD (e.g. TFJob) then run training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go Lines 206 to 211 in 69813fb
|
A little complex 😂. |
I got it |
@HeGaoYuan Can you update it? |
@johnugeorge Yes, I can update. But then how about this? Should I keep |
I would recommend to update the PR to use In the next release, we can discuss and implement the state change in #1711 /cc @tenzen-y What do you think? |
@HeGaoYuan We are creating a release tomorrow. Can you update this PR and rebase ? |
Sorry for the late response. I was missing the notification. That makes sense. It would be better to discuss that after the next release since we should handle the behavior of Job conditions, carefully. So, it would be better to change only an error reason. |
@johnugeorge Would you like to take over this PR before we cut the new release? Or, we postpone releasing this improvement after the next release? |
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@johnugeorge @tenzen-y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HeGaoYuan Thanks for the updates!
/lgtm
/assign @johnugeorge
@tenzen-y Have you noticed that tests are really flaky now? |
Is it E2E? |
There are e2e failures. Also, Publish Images workflows take longer time |
Thanks @HeGaoYuan /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: HeGaoYuan, johnugeorge The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@johnugeorge I don't think these changes caused these flaky tests.
As I can see, training jobs sometimes fail... We might need to improve sample training codes.
I guess building jobs have taken longer times since #1692. |
Which issue(s) this PR fixes : Fixes #1704
Checklist:
And I found the event reason constant is a little "messy", so I use string literal but I am waiting to rebase my codes