Restoration fails in clusters with injected InitContainers because of order. #3008

doughepi · 2020-10-14T03:47:53Z

What steps did you take and what happened:
No different from a normal backup and restore process using Restic with the --default-volumes-to-restic flag turned on.

Began backup using velero backup create backup-test-3 --storage-location gcp --default-volumes-to-restic --include-namespaces database-portal-env-dev -n velero-system.
Waited for backup to complete and verified that volumes were being backed up correctly.
a. Would like a way to exclude volumes globally, because there's no reason to back up all the istio-envoy volumes on every pod. This is unrelated to this issue, of course.
Backup compled and I deleted the namespace using kubectl delete namespace database-portal-env-dev.
Attempted restore using velero create restore restore-test-6 --from-backup backup-test-1 --include-namespaces database-portal-env-dev -n velero-system
Waited for the restore complete, but it seemed to be stuck "InProgress".
Inspected the PodVolumeRestore resources, and all seemed to be without status.
Inspected the restic-wait init container logs and found many information messages about how the restore flag file was not found.

What did you expect to happen:
While I expect that some Not found: /restores/istio-envoy/.velero/18ff2bf4-eef8-4f46-b821-9da2a0d623da messages would appear during the restore process, the restore times out because of some unknown reason. No volumes are restored and the restored pods never start because they're all held by the restic-wait container.

Init Containers:
  istio-validation:
    Container ID:  docker://36eb8ab83d4ea0ce1b5e1a238ecf07bdb26c9d8602868a46788f66c48086ccf0
    Image:         docker.io/istio/proxyv2:1.7.2
    Image ID:      docker-pullable://istio/proxyv2@sha256:26b91ecc8be9cd171acbee0f8b99e99f5b1a0fb3290fe54ad3705830c10a8ff5
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x
      
      -b
      *
      -d
      15090,15021,15020
      --run-validation
      --skip-rule-apply
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 13 Oct 2020 22:17:47 -0500
      Finished:     Tue, 13 Oct 2020 22:17:48 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     10m
      memory:  10Mi
    Environment:
      DNS_AGENT:  
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from database-portal-env-dev-service-account-token-m7vqs (ro)
  restic-wait:
    Container ID:  docker://ed177704313657571c398816f85d89a835cb48a50f8a5f1d11a7a07757244a3f
    Image:         velero/velero-restic-restore-helper:v1.5.1
    Image ID:      docker-pullable://velero/velero-restic-restore-helper@sha256:a448e8ec2764742282ea5fa254660cfb164e84fe846fd26d9496c413b27ca528
    Port:          <none>
    Host Port:     <none>
    Command:
      /velero-restic-restore-helper
    Args:
      18ff2bf4-eef8-4f46-b821-9da2a0d623da
    State:          Running
      Started:      Tue, 13 Oct 2020 22:17:48 -0500
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:     100m
      memory:  128Mi
    Environment:
      POD_NAMESPACE:  database-portal-env-dev (v1:metadata.namespace)
      POD_NAME:       blog-deployment-69bd55b7fb-gg6zg (v1:metadata.name)
    Mounts:
      /restores/blog-volume from blog-volume (rw)
      /restores/istio-envoy from istio-envoy (rw)
      /restores/istio-podinfo from istio-podinfo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from database-portal-env-dev-service-account-token-m7vqs (ro)

The output of the following commands will help us better understand what's going on:

kubectl logs deployment/velero -n velero
I'd like to avoid providing this for fear of revealing sensitive information about what's running in our cluster. I can provide specifics if needed. There was nothing with level > info.
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
link
velero backup logs <backupname>
link
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
link
velero restore logs <restorename>
Restore is still in progress.

My theory
I believe this issue reveals itself in clusters that are subject to webhook mutations that inject InitContainers into pod definitions when the pod is created by Velero when restoring. The corresponding code that velero uses to detect the start of the restic-wait container explicitly checks for the restic-wait container at InitContainer index 0. If a mutating webhook inserts a container before this point, the check fails and the restore process never begins.

In the case of our cluster, the Istio injection with CNI enabled injects a istio-validation pod in place before the restic-wait container--so it holds all of our restores. istio code where init container is added

Proposed fix
I propose that we relax the requirements in the pod restore controller to allow InitContainers before the restic-wait container, with the assumption that if this occurs the preceding InitContainers are being placed at their own risk. I also propose that we allow this, but print a warning message to the logs to inform the user that this order could lead to failed restores.

I'm open to any and all opinions and options though. I'm going to put up a pull request in the meantime to do what I describe--to check for the restic-wait container at any InitContainer position instead of expecting it to be at index 0.

Environment:

Velero version (use velero version):

Client:
        Version: v1.5.1
        Git commit: 87d86a45a6ca66c6c942c7c7f08352e26809426c
Server:
        Version: v1.5.1

Velero features (use velero client config get features):

Not set

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-26T03:47:41Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-gke.6", GitCommit:"f97d54712d01b555515656671a2cf171b4f47bb8", GitTreeState:"clean", BuildDate:"2020-03-25T20:11:05Z", GoVersion:"go1.13.8b4", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes installer & version:
We're using Anthos on-premise and the gkectl client that comes with.
Cloud provider or hardware configuration:
This is running in a VMWare virtualized environment.
OS (e.g. from /etc/os-release):
Nodes are on Ubuntu server.

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

doughepi · 2020-10-14T03:58:35Z

adding @skriss as the original author of the linked code & because you're probably more aware of the implications of my proposed fix

doughepi · 2020-10-16T18:32:29Z

Side note that while my pull request #3011 fixes this issue, additional issue #2997 is revealed.

carlisia · 2020-10-19T16:54:12Z

This issue seems to have enough info for investigation + a PR open for review. Issues with PRs will always move faster, so expect some input soon.

carlisia · 2020-10-19T16:55:14Z

@skriss since you're no longer on the project, feel free to ignore. We'll ping you if needed.

doughepi mentioned this issue Oct 16, 2020

Adding fix for restic init container index on restores. #3011

Merged

carlisia added the Needs investigation label Oct 19, 2020

ashish-amarnath closed this as completed in #3011 Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restoration fails in clusters with injected InitContainers because of order. #3008

Restoration fails in clusters with injected InitContainers because of order. #3008

doughepi commented Oct 14, 2020 •

edited

Loading

doughepi commented Oct 14, 2020

doughepi commented Oct 16, 2020 •

edited

Loading

carlisia commented Oct 19, 2020

carlisia commented Oct 19, 2020

Restoration fails in clusters with injected InitContainers because of order. #3008

Restoration fails in clusters with injected InitContainers because of order. #3008

Comments

doughepi commented Oct 14, 2020 • edited Loading

doughepi commented Oct 14, 2020

doughepi commented Oct 16, 2020 • edited Loading

carlisia commented Oct 19, 2020

carlisia commented Oct 19, 2020

doughepi commented Oct 14, 2020 •

edited

Loading

doughepi commented Oct 16, 2020 •

edited

Loading