You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What steps did you take and what happened:
No different from a normal backup and restore process using Restic with the --default-volumes-to-restic flag turned on.
Began backup using velero backup create backup-test-3 --storage-location gcp --default-volumes-to-restic --include-namespaces database-portal-env-dev -n velero-system.
Waited for backup to complete and verified that volumes were being backed up correctly.
a. Would like a way to exclude volumes globally, because there's no reason to back up all the istio-envoy volumes on every pod. This is unrelated to this issue, of course.
Backup compled and I deleted the namespace using kubectl delete namespace database-portal-env-dev.
Waited for the restore complete, but it seemed to be stuck "InProgress".
Inspected the PodVolumeRestore resources, and all seemed to be without status.
Inspected the restic-wait init container logs and found many information messages about how the restore flag file was not found.
What did you expect to happen:
While I expect that some Not found: /restores/istio-envoy/.velero/18ff2bf4-eef8-4f46-b821-9da2a0d623da messages would appear during the restore process, the restore times out because of some unknown reason. No volumes are restored and the restored pods never start because they're all held by the restic-wait container.
The output of the following commands will help us better understand what's going on:
kubectl logs deployment/velero -n velero
I'd like to avoid providing this for fear of revealing sensitive information about what's running in our cluster. I can provide specifics if needed. There was nothing with level > info.
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml link
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml link
velero restore logs <restorename>
Restore is still in progress.
My theory
I believe this issue reveals itself in clusters that are subject to webhook mutations that inject InitContainers into pod definitions when the pod is created by Velero when restoring. The corresponding code that velero uses to detect the start of the restic-wait container explicitly checks for the restic-wait container at InitContainer index 0. If a mutating webhook inserts a container before this point, the check fails and the restore process never begins.
In the case of our cluster, the Istio injection with CNI enabled injects a istio-validation pod in place before the restic-wait container--so it holds all of our restores. istio code where init container is added
Proposed fix
I propose that we relax the requirements in the pod restore controller to allow InitContainers before the restic-wait container, with the assumption that if this occurs the preceding InitContainers are being placed at their own risk. I also propose that we allow this, but print a warning message to the logs to inform the user that this order could lead to failed restores.
I'm open to any and all opinions and options though. I'm going to put up a pull request in the meantime to do what I describe--to check for the restic-wait container at any InitContainer position instead of expecting it to be at index 0.
Kubernetes installer & version:
We're using Anthos on-premise and the gkectl client that comes with.
Cloud provider or hardware configuration:
This is running in a VMWare virtualized environment.
OS (e.g. from /etc/os-release):
Nodes are on Ubuntu server.
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"
The text was updated successfully, but these errors were encountered:
What steps did you take and what happened:
No different from a normal backup and restore process using Restic with the
--default-volumes-to-restic
flag turned on.velero backup create backup-test-3 --storage-location gcp --default-volumes-to-restic --include-namespaces database-portal-env-dev -n velero-system
.a. Would like a way to exclude volumes globally, because there's no reason to back up all the
istio-envoy
volumes on every pod. This is unrelated to this issue, of course.kubectl delete namespace database-portal-env-dev
.velero create restore restore-test-6 --from-backup backup-test-1 --include-namespaces database-portal-env-dev -n velero-system
PodVolumeRestore
resources, and all seemed to be without status.restic-wait
init container logs and found many information messages about how the restore flag file was not found.What did you expect to happen:
While I expect that some
Not found: /restores/istio-envoy/.velero/18ff2bf4-eef8-4f46-b821-9da2a0d623da
messages would appear during the restore process, the restore times out because of some unknown reason. No volumes are restored and the restored pods never start because they're all held by therestic-wait
container.The output of the following commands will help us better understand what's going on:
kubectl logs deployment/velero -n velero
I'd like to avoid providing this for fear of revealing sensitive information about what's running in our cluster. I can provide specifics if needed. There was nothing with
level > info
.velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
link
velero backup logs <backupname>
link
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
link
velero restore logs <restorename>
Restore is still in progress.
My theory
I believe this issue reveals itself in clusters that are subject to webhook mutations that inject InitContainers into pod definitions when the pod is created by Velero when restoring. The corresponding code that velero uses to detect the start of the
restic-wait
container explicitly checks for therestic-wait
container at InitContainer index 0. If a mutating webhook inserts a container before this point, the check fails and the restore process never begins.In the case of our cluster, the Istio injection with CNI enabled injects a
istio-validation
pod in place before therestic-wait
container--so it holds all of our restores. istio code where init container is addedProposed fix
I propose that we relax the requirements in the pod restore controller to allow
InitContainers
before therestic-wait
container, with the assumption that if this occurs the precedingInitContainers
are being placed at their own risk. I also propose that we allow this, but print awarning
message to the logs to inform the user that this order could lead to failed restores.I'm open to any and all opinions and options though. I'm going to put up a pull request in the meantime to do what I describe--to check for the
restic-wait
container at anyInitContainer
position instead of expecting it to be at index 0.Environment:
velero version
):velero client config get features
):kubectl version
):Kubernetes installer & version:
We're using Anthos on-premise and the
gkectl
client that comes with.Cloud provider or hardware configuration:
This is running in a VMWare virtualized environment.
OS (e.g. from
/etc/os-release
):Nodes are on Ubuntu server.
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: