Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail the backup with restic annotated volumes if restic servers is not up and running #4874

Closed
lintongj opened this issue Apr 28, 2022 · 4 comments · Fixed by #5319
Closed
Assignees
Milestone

Comments

@lintongj
Copy link
Contributor

Describe the problem/challenge you have
[A description of the current limitation/problem/challenge that you are experiencing.]

If a backup includes restic annotated volumes while restic servers is not up and running, the backup will hang there for upto 4 hours by default.

The original issue was reported by a user of velero-plugin-for-vsphere. But, there is nothing that vsphere plugin can do, as it should be general to velero running on any cloud provider. Need help from velero core team.

Expectation: velero can fail fast the backup with restic annotated volumes if restic servers is not up and running.

cc: @xing-yang @cormachogan

Describe the solution you'd like
[A clear and concise description of what you want to happen.]

N/A

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

N/A

Environment:

  • Velero version (use velero version): v1.5.1
  • Kubernetes version (use kubectl version): unknown
  • Kubernetes installer & version: unknown
  • Cloud provider or hardware configuration: vSphere
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "The project would be better with this feature added"
  • 👎 for "This feature will not enhance the project in a meaningful way"
@ywk253100
Copy link
Contributor

@lintongj I'm wondering why the backup contains restic annotated volumes while the restic server isn't running? Is the restic server isn't installed or it got errors when starting? Just want to make sure this is caused by incorrect configuration or not?

@lintongj
Copy link
Contributor Author

I'm wondering why the backup contains restic annotated volumes while the restic server isn't running? Is the restic server isn't installed or it got errors when starting?

That could be a user error if user is not so familiar with velero.

In the case I reported from, the restic server is not installed. But, I believe if the restic server is not installed well, i.e., got errors when starting, it might end up with the same user experience.

Overall, I am wondering if velero can do anything to improve user experience in such a case.

@ywk253100
Copy link
Contributor

Currently, Velero creates two separated CRs PodVolumeBackup/PodVolumeRestore to do the backup/restore with Restic
and watches the status change of the CRs to do the next work. So I don't think there is an easy way for Velero to improve in this case.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 15, 2022

[Cause]
For PodVolumeBackup, Velero creates the PodVolumeBackup CR then waits on the CR's status for the PodVolumeBackup controller to handle the CR
When the controller handles the CR, it first set the CR's status from New to InProgress, then on completion, it sets the status to Finish or Fail
The PodVolumeBackup controller runs inside the Restic server(daemonset). If Restic server is not installed, no one will handle the CR, as a result, Velero always waits until the timeout of the backup, 4 hours

[Solution]
The Restic daemonset pods have a label - "name"="restic", we will use this label to do some pre-flight check:

  • As the existing logic, PodVolume backup always associates with a pod. PodVolume backup for this pod must run on the same node.
  • Then we can check if there is a restic daemonset pod on that node
  • If the restic daemonset pod is not present, it means the restic backup has no way to succeed, so we fail it immediately for the pod, and finally the overall backup status is PartiallyFailed

[More Info]

  • This solution could solve this problem in the cases that Restic Daemonset pod is missing on a related node, for example:
    • Restic is not installed, as for the current issue
    • Restic daemonset pod is not scheduled on the node because of node taint
  • PodVolume Restore has the same problem, we will fix it in the same way
  • This solution only checks the existence of restic daemonset pods, it doesn't check the healthy of controllers running inside the pods, if they are not healthy, we still face the same problem - the backup/restore waits for 4 hours. But this is not in the scope of the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants