-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark InProgress backup/restore as failed upon requeuing #7863
Conversation
45111eb
to
91b2a54
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7863 +/- ##
==========================================
- Coverage 58.80% 58.78% -0.02%
==========================================
Files 345 345
Lines 28759 28777 +18
==========================================
+ Hits 16911 16917 +6
- Misses 10420 10431 +11
- Partials 1428 1429 +1 ☔ View full report in Codecov by Sentry. |
97b85e8
to
c6aa114
Compare
Signed-off-by: Tiger Kaovilai <[email protected]> remove uuid, return err to requeue instead of requeue: true Signed-off-by: Tiger Kaovilai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need to add unit tests and will mark ready for review.
// return the error so the status can be re-processed; it's currently still not completed or failed | ||
return ctrl.Result{}, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we patch completion status on restore. We can store completion in memory so that subsequent requeues patch status to complete rather than to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in memory or on disk? If in-memory, we just need to be careful about thread issues (to handle future cases where we have multiple reconcilers).
Also, we have the regular controller and the finalizer controller, so two different places/patches to deal with. i.e. which CR and which reconciler/controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finalizer controllers won't mark backup/restore as completed right? This memory is limited to Completed phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the finalizer controller is the only one that marks it completed. Backup/restore controller moves it from InProgress to Finalizing. Both of these transitions need requeue-to-fail-or-update though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack Thanks for the catch.
@@ -307,7 +320,10 @@ func (b *backupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr | |||
log.Info("Updating backup's final status") | |||
if err := kubeutil.PatchResource(original, request.Backup, b.kbClient); err != nil { | |||
log.WithError(err).Error("error updating backup's final status") | |||
// return the error so the status can be re-processed; it's currently still not completed or failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember the CR won't be re-enqueued when returning an error from the reconciler, could you confirm this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had confirmed that returning error would cause reconciler to requeue.
This is equivalent to returning with requeue: true
but with more information.
See: https://pkg.go.dev/sigs.k8s.io/[email protected]/pkg/reconcile#Reconciler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If the returned error is non-nil, the Result is ignored and the request will be
// requeued using exponential backoff. The only exception is if the error is a
// TerminalError in which case no requeuing happens.
@@ -230,6 +230,20 @@ func (b *backupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr | |||
switch original.Status.Phase { | |||
case "", velerov1api.BackupPhaseNew: | |||
// only process new backups | |||
case velerov1api.BackupPhaseInProgress: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we don't add this change, this assumes the inProgress CR won't be re-enqueued, however, this constraints may hinder the future developments:
- If we want to add a periodical enqueue mechanism, this assumption will break
- If we want to support some feature like Cancel, this assumption will break
- When we support parallel backups, this assumption will break
- This mechanism can only work with backup/restore CR in the Velero server pod, for the CRs in the node-agent, this assumption is not true as node-agent is a daemonset
Generally, this change is too delicate and unless necessary, we should avoid to add the change. And consider the problem we are trying to solve, I think the retry mechanism is enough as a short term fix; for the long term, we should develop the mechanism to work with backup window, the current change won't make the ultimate fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"When we support parallel backups, this assumption will break"
Actually, this shouldn't break it. With parallel backups, InProgress backups already being handled by controller threads won't be reconciled again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry is a good option for temporary APIServer outages (which was what originally motivated this). However, we still have a somewhat significant bug here if the APIServer is gone for an extended period of time. We end up with backups and restores (and yes, also DU/DD CRs) in an invalid state, listed as "InProgress" but which will never be touched again as long as velero stays up. The only workaround is to force-restart the Velero pod, which can cause ongoing opertations to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree we still have limitation if only supporting the retry mechanism. But as listed above, the current change is not a ultimate solution/universal solution for a controllers.
If this change does everything good, we can include it. However, I think this change adding a new path in the state machine of the reconciler, which is not a good design pattern and will hinder the future development/evolvement. Specifically, when we adding a new feature/change, we have to consider this new path, e.g., this change may be broken when we support cancel, parallel backups, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to go into the details for the cancel or parallel backup because we don't have it for now. But generally speaking, with this change, we must keep one more rule in our mind --- the InProgress CR cannot be reconciled again, otherwise, it will be marked as failed. This will deprive our choices when we develop new features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest changes in https://github.com/openshift/velero/pull/324/files I believe we have addressed ability to cancel with the use of backupTracker/restoreTracker
Closing as we prefer simpler retry approach which would not be as complicated for future reconciler work like requeue. |
Signed-off-by: Tiger Kaovilai [email protected]
Thank you for contributing to Velero!
Please add a summary of your change
Does your change fix a particular issue?
Fixes #7207
Please indicate you've done the following:
/kind changelog-not-required
as a comment on this pull request.site/content/docs/main
.