Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing #315

Merged

Conversation

kaovilai
Copy link
Member

@kaovilai kaovilai commented Jun 11, 2024

Signed-off-by: Tiger Kaovilai [email protected]

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #(issue)

Please indicate you've done the following:

  • Accepted the DCO. Commits without the DCO will delay acceptance.
  • Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
  • Updated the corresponding documentation in site/content/docs/main.

Signed-off-by: Tiger Kaovilai <[email protected]>

remove uuid, return err to requeue instead of requeue: true

Signed-off-by: Tiger Kaovilai <[email protected]>
@kaovilai kaovilai changed the title Mark InProgress backup/restore as failed upon requeuing oadp-1.3: Mark InProgress backup/restore as failed upon requeuing Jun 11, 2024
@kaovilai
Copy link
Member Author

kaovilai commented Jun 11, 2024

from prior feedback need to do finalizer (operations in 1.3) controller also

@kaovilai
Copy link
Member Author

from prior feedback need to do finalizer (operations in 1.3) controller also

For restore operations controller we're covered by this returning error:

return ctrl.Result{}, errors.Wrap(err, "error updating Restore")

backup operations controller:

return ctrl.Result{}, errors.Wrap(err, "error updating Backup")

@sseago
Copy link

sseago commented Jun 11, 2024

Yes, I think we're fine leaving the backup/restore_operations_controller returns alone, since those already requeue. I don't think we have restore finalizer in 1.3, so we just need backup, restore, and backup finalizer.

@kaovilai
Copy link
Member Author

If patch fails on finalizer controller we wanna retry patch as fail? It's kinda difficult there since it's currently using defer func.

I would have to break the patch call out of defer to return reconciler err on patch fail. Is that ok with you?

Copy link

@sseago sseago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor comments on error messages and the change of variable name. The bigger issue is we need the change for the backup finalizer controller as well (no restore finalizer controller in Velero 1.12, so that's not a concern here).

log.Debug("Backup has in progress status from prior reconcile, marking it as failed")
failedCopy := original.DeepCopy()
failedCopy.Status.Phase = velerov1api.BackupPhaseFailed
failedCopy.Status.FailureReason = "Backup from previous reconcile still in progress"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to suggest an APIServer failure here.
"Backup from previous reconcile still in progress. The API Server may have been down."

@@ -249,7 +263,6 @@ func (b *backupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr
request.Status.Phase = velerov1api.BackupPhaseInProgress
request.Status.StartTimestamp = &metav1.Time{Time: b.clock.Now()}
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To minimize change from upstream since this is in our fork, lets not include whitespace changes like this.

log.Debug("Restore has in progress status from prior reconcile, marking it as failed")
failedCopy := original.DeepCopy()
failedCopy.Status.Phase = api.RestorePhaseFailed
failedCopy.Status.FailureReason = "Restore from previous reconcile still in progress"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to suggest an APIServer failure here.
"Restore from previous reconcile still in progress. The API Server may have been down."

@@ -162,8 +162,8 @@ func (r *restoreReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
// the controller.
log := r.logger.WithField("Restore", req.NamespacedName.String())

restore := &api.Restore{}
err := r.kbClient.Get(ctx, client.ObjectKey{Namespace: req.Namespace, Name: req.Name}, restore)
original := &api.Restore{}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the variable name change necessary here? This makes the diff larger and increases the possibility of rebase conflicts, since we're carrying this commit in our fork.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this so restore_controller and backup_controller has the same var name pattern that's all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary I agree, just make the logic more pastable across both.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaovilai I figured that's why you did that. I reverted that part here in a later commit since we're carrying the commit in our fork right now, and it removed a lot of lines from the diff, making rebase conflicts less likely. Does that seem reasonable to you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. That's reasonable

@sseago
Copy link

sseago commented Jun 11, 2024

If patch fails on finalizer controller we wanna retry patch as fail? It's kinda difficult there since it's currently using defer func.

I would have to break the patch call out of defer to return reconciler err on patch fail. Is that ok with you?

Hmm. I think we need to do this in some way. If we can't do it with defer, then we'll need to eliminate the defer call and include this with all return statements.

sseago added 3 commits June 12, 2024 11:52
Signed-off-by: Scott Seago <[email protected]>
Unlike the InProgress transition, there's no need to fail here,
since the Finalize steps can be repeated.
@weshayutin weshayutin changed the title oadp-1.3: Mark InProgress backup/restore as failed upon requeuing oadp-1.3: OADP-4265 Mark InProgress backup/restore as failed upon requeuing Jun 12, 2024
@kaovilai
Copy link
Member Author

lgtm

@weshayutin
Copy link

ok.. testing update:

w/o the patch, the backup stayed in progress. While updating the dpa the velero server was restarted.

backup: westest-vsphere-apidown-1

status:
  completionTimestamp: "2024-06-17T17:25:58Z"
  expiration: "2024-07-17T17:12:21Z"
  failureReason: found a backup with status "InProgress" during the server starting,
    mark it as "Failed"

1.3.0 have to change the csv to get the test image on:

oc get csv oadp-operator.v1.3.0 -o yaml | grep sseago
                  value: quay.io/sseago/velero:1.3-requeue

Once patched, I initiated a second backup and then took down the api server for roughly 1 minute:

backup: westest-vsphere-apidown-2

status:
  expiration: "2024-07-17T17:38:01Z"
  failureReason: Backup from previous reconcile still in progress. The API Server
    may have been down.
  formatVersion: 1.1.0
  phase: Failed
  startTimestamp: "2024-06-17T17:38:01Z"
  version: 1

Copy link

@weshayutin weshayutin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2024
Copy link

@sseago sseago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing to remove my prior "changes requested", but don't count it as an ack, since my own changes are in here as well.

@mateusoliveira43

This comment was marked as resolved.

@kaovilai
Copy link
Member Author

Just noting that this may never upstream based on comments at vmware-tanzu#7863 (comment)

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 17, 2024
Copy link

openshift-ci bot commented Jun 17, 2024

New changes are detected. LGTM label has been removed.

@kaovilai kaovilai force-pushed the requeue&fail-oadp-1.3 branch from 07bab34 to dd74100 Compare June 17, 2024 21:32
Makefile.prow Outdated Show resolved Hide resolved
@kaovilai kaovilai force-pushed the requeue&fail-oadp-1.3 branch from dd74100 to a51ef63 Compare June 17, 2024 23:24
@kaovilai
Copy link
Member Author

retest after #320

@kaovilai
Copy link
Member Author

/retest

@weshayutin weshayutin added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Jun 18, 2024
Copy link

openshift-ci bot commented Jun 18, 2024

@kaovilai: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-ci bot commented Jun 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: kaovilai, sseago, weshayutin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 4b5cf07 into openshift:oadp-1.3 Jun 18, 2024
3 checks passed
@kaovilai
Copy link
Member Author

follow on bugfix: #324

openshift-merge-bot bot pushed a commit that referenced this pull request Jul 30, 2024
…330)

* oadp-1.4: OADP-3227: Mark InProgress backup/restore as failed upon requeuing (#315)

* Mark InProgress backup/restore as failed upon requeuing

Signed-off-by: Tiger Kaovilai <[email protected]>

remove uuid, return err to requeue instead of requeue: true

Signed-off-by: Tiger Kaovilai <[email protected]>

* cleanup to minimize diff from upstream

Signed-off-by: Scott Seago <[email protected]>

* error message update

Signed-off-by: Scott Seago <[email protected]>

* requeue on finalize status update.

Unlike the InProgress transition, there's no need to fail here,
since the Finalize steps can be repeated.

* Only run patch once for all backup finalizer return scenarios

Signed-off-by: Scott Seago <[email protected]>

---------

Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Scott Seago <[email protected]>
Co-authored-by: Scott Seago <[email protected]>

* oadp-1.4: OADP-3227: Reconcile To Fail: Add backup/restore trackers (#324)

* OADP-4265: Reconcile To Fail: Add backup/restore trackers

Signed-off-by: Tiger Kaovilai <[email protected]>

* Apply suggestions from code review: backupTracker

* Address restoreTracker feedback

Signed-off-by: Tiger Kaovilai <[email protected]>

* s/delete from/add to/ in the comment

* unit test fix

Signed-off-by: Tiger Kaovilai <[email protected]>

* backup_controller unit test

Signed-off-by: Tiger Kaovilai <[email protected]>

* restore_controller unit test

Signed-off-by: Tiger Kaovilai <[email protected]>

* `make update`

Signed-off-by: Tiger Kaovilai <[email protected]>

* mock patch to fail failure due to connection refused

Signed-off-by: Tiger Kaovilai <[email protected]>

---------

Signed-off-by: Tiger Kaovilai <[email protected]>

* regenerate mocks

Signed-off-by: Tiger Kaovilai <[email protected]>

---------

Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Scott Seago <[email protected]>
Co-authored-by: Scott Seago <[email protected]>
sseago added a commit to sseago/velero that referenced this pull request Aug 1, 2024
…penshift#330)

* oadp-1.4: OADP-3227: Mark InProgress backup/restore as failed upon requeuing (openshift#315)

* Mark InProgress backup/restore as failed upon requeuing

Signed-off-by: Tiger Kaovilai <[email protected]>

remove uuid, return err to requeue instead of requeue: true

Signed-off-by: Tiger Kaovilai <[email protected]>

* cleanup to minimize diff from upstream

Signed-off-by: Scott Seago <[email protected]>

* error message update

Signed-off-by: Scott Seago <[email protected]>

* requeue on finalize status update.

Unlike the InProgress transition, there's no need to fail here,
since the Finalize steps can be repeated.

* Only run patch once for all backup finalizer return scenarios

Signed-off-by: Scott Seago <[email protected]>

---------

Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Scott Seago <[email protected]>
Co-authored-by: Scott Seago <[email protected]>

* oadp-1.4: OADP-3227: Reconcile To Fail: Add backup/restore trackers (openshift#324)

* OADP-4265: Reconcile To Fail: Add backup/restore trackers

Signed-off-by: Tiger Kaovilai <[email protected]>

* Apply suggestions from code review: backupTracker

* Address restoreTracker feedback

Signed-off-by: Tiger Kaovilai <[email protected]>

* s/delete from/add to/ in the comment

* unit test fix

Signed-off-by: Tiger Kaovilai <[email protected]>

* backup_controller unit test

Signed-off-by: Tiger Kaovilai <[email protected]>

* restore_controller unit test

Signed-off-by: Tiger Kaovilai <[email protected]>

* `make update`

Signed-off-by: Tiger Kaovilai <[email protected]>

* mock patch to fail failure due to connection refused

Signed-off-by: Tiger Kaovilai <[email protected]>

---------

Signed-off-by: Tiger Kaovilai <[email protected]>

* regenerate mocks

Signed-off-by: Tiger Kaovilai <[email protected]>

---------

Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Scott Seago <[email protected]>
Co-authored-by: Scott Seago <[email protected]>
sseago added a commit to sseago/velero that referenced this pull request Aug 22, 2024
…penshift#330)

* oadp-1.4: OADP-3227: Mark InProgress backup/restore as failed upon requeuing (openshift#315)

* Mark InProgress backup/restore as failed upon requeuing

Signed-off-by: Tiger Kaovilai <[email protected]>

remove uuid, return err to requeue instead of requeue: true

Signed-off-by: Tiger Kaovilai <[email protected]>

* cleanup to minimize diff from upstream

Signed-off-by: Scott Seago <[email protected]>

* error message update

Signed-off-by: Scott Seago <[email protected]>

* requeue on finalize status update.

Unlike the InProgress transition, there's no need to fail here,
since the Finalize steps can be repeated.

* Only run patch once for all backup finalizer return scenarios

Signed-off-by: Scott Seago <[email protected]>

---------

Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Scott Seago <[email protected]>
Co-authored-by: Scott Seago <[email protected]>

* oadp-1.4: OADP-3227: Reconcile To Fail: Add backup/restore trackers (openshift#324)

* OADP-4265: Reconcile To Fail: Add backup/restore trackers

Signed-off-by: Tiger Kaovilai <[email protected]>

* Apply suggestions from code review: backupTracker

* Address restoreTracker feedback

Signed-off-by: Tiger Kaovilai <[email protected]>

* s/delete from/add to/ in the comment

* unit test fix

Signed-off-by: Tiger Kaovilai <[email protected]>

* backup_controller unit test

Signed-off-by: Tiger Kaovilai <[email protected]>

* restore_controller unit test

Signed-off-by: Tiger Kaovilai <[email protected]>

* `make update`

Signed-off-by: Tiger Kaovilai <[email protected]>

* mock patch to fail failure due to connection refused

Signed-off-by: Tiger Kaovilai <[email protected]>

---------

Signed-off-by: Tiger Kaovilai <[email protected]>

* regenerate mocks

Signed-off-by: Tiger Kaovilai <[email protected]>

---------

Signed-off-by: Tiger Kaovilai <[email protected]>
Signed-off-by: Scott Seago <[email protected]>
Co-authored-by: Scott Seago <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants