Fail the backup with restic annotated volumes if restic servers is not up and running #4874

lintongj · 2022-04-28T20:54:00Z

Describe the problem/challenge you have
[A description of the current limitation/problem/challenge that you are experiencing.]

If a backup includes restic annotated volumes while restic servers is not up and running, the backup will hang there for upto 4 hours by default.

The original issue was reported by a user of velero-plugin-for-vsphere. But, there is nothing that vsphere plugin can do, as it should be general to velero running on any cloud provider. Need help from velero core team.

Expectation: velero can fail fast the backup with restic annotated volumes if restic servers is not up and running.

cc: @xing-yang @cormachogan

Describe the solution you'd like
[A clear and concise description of what you want to happen.]

N/A

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

N/A

Environment:

Velero version (use velero version): v1.5.1
Kubernetes version (use kubectl version): unknown
Kubernetes installer & version: unknown
Cloud provider or hardware configuration: vSphere
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "The project would be better with this feature added"
👎 for "This feature will not enhance the project in a meaningful way"

The text was updated successfully, but these errors were encountered:

ywk253100 · 2022-04-29T07:13:14Z

@lintongj I'm wondering why the backup contains restic annotated volumes while the restic server isn't running? Is the restic server isn't installed or it got errors when starting? Just want to make sure this is caused by incorrect configuration or not?

lintongj · 2022-04-29T16:20:58Z

I'm wondering why the backup contains restic annotated volumes while the restic server isn't running? Is the restic server isn't installed or it got errors when starting?

That could be a user error if user is not so familiar with velero.

In the case I reported from, the restic server is not installed. But, I believe if the restic server is not installed well, i.e., got errors when starting, it might end up with the same user experience.

Overall, I am wondering if velero can do anything to improve user experience in such a case.

ywk253100 · 2022-05-05T02:03:36Z

Currently, Velero creates two separated CRs PodVolumeBackup/PodVolumeRestore to do the backup/restore with Restic
and watches the status change of the CRs to do the next work. So I don't think there is an easy way for Velero to improve in this case.

Lyndon-Li · 2022-06-15T10:22:40Z

[Cause]
For PodVolumeBackup, Velero creates the PodVolumeBackup CR then waits on the CR's status for the PodVolumeBackup controller to handle the CR
When the controller handles the CR, it first set the CR's status from New to InProgress, then on completion, it sets the status to Finish or Fail
The PodVolumeBackup controller runs inside the Restic server(daemonset). If Restic server is not installed, no one will handle the CR, as a result, Velero always waits until the timeout of the backup, 4 hours

[Solution]
The Restic daemonset pods have a label - "name"="restic", we will use this label to do some pre-flight check:

As the existing logic, PodVolume backup always associates with a pod. PodVolume backup for this pod must run on the same node.
Then we can check if there is a restic daemonset pod on that node
If the restic daemonset pod is not present, it means the restic backup has no way to succeed, so we fail it immediately for the pod, and finally the overall backup status is PartiallyFailed

[More Info]

This solution could solve this problem in the cases that Restic Daemonset pod is missing on a related node, for example:
- Restic is not installed, as for the current issue
- Restic daemonset pod is not scheduled on the node because of node taint
PodVolume Restore has the same problem, we will fix it in the same way
This solution only checks the existence of restic daemonset pods, it doesn't check the healthy of controllers running inside the pods, if they are not healthy, we still face the same problem - the backup/restore waits for 4 hours. But this is not in the scope of the fix

ywk253100 added the kind/tech-debt label May 9, 2022

ywk253100 assigned Lyndon-Li May 9, 2022

reasonerjt added backlog 1.10-candidate The label used for 1.10 planning discussion. labels May 30, 2022

reasonerjt mentioned this issue Jun 15, 2022

Velero attempts pod volume backups on nodes where restic is not scheduled #4752

Closed

Lyndon-Li mentioned this issue Jun 22, 2022

PodVolume Restore fails after 4 hours if Restic pod is not scheduled/installed #5043

Closed

reasonerjt removed the 1.10-candidate The label used for 1.10 planning discussion. label Jun 28, 2022

reasonerjt added this to the 1.10 milestone Jun 28, 2022

blackpiglet mentioned this issue Sep 14, 2022

Issue fix 4874 and 4752 #5319

Merged

Lyndon-Li closed this as completed in #5319 Sep 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail the backup with restic annotated volumes if restic servers is not up and running #4874

Fail the backup with restic annotated volumes if restic servers is not up and running #4874

lintongj commented Apr 28, 2022

ywk253100 commented Apr 29, 2022

lintongj commented Apr 29, 2022

ywk253100 commented May 5, 2022

Lyndon-Li commented Jun 15, 2022 •

edited

Loading

Fail the backup with restic annotated volumes if restic servers is not up and running #4874

Fail the backup with restic annotated volumes if restic servers is not up and running #4874

Comments

lintongj commented Apr 28, 2022

ywk253100 commented Apr 29, 2022

lintongj commented Apr 29, 2022

ywk253100 commented May 5, 2022

Lyndon-Li commented Jun 15, 2022 • edited Loading

Lyndon-Li commented Jun 15, 2022 •

edited

Loading