Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restoration fails in clusters with injected InitContainers because of order. #3008

Closed
doughepi opened this issue Oct 14, 2020 · 4 comments · Fixed by #3011
Closed

Restoration fails in clusters with injected InitContainers because of order. #3008

doughepi opened this issue Oct 14, 2020 · 4 comments · Fixed by #3011

Comments

@doughepi
Copy link
Contributor

doughepi commented Oct 14, 2020

What steps did you take and what happened:
No different from a normal backup and restore process using Restic with the --default-volumes-to-restic flag turned on.

  1. Began backup using velero backup create backup-test-3 --storage-location gcp --default-volumes-to-restic --include-namespaces database-portal-env-dev -n velero-system.
  2. Waited for backup to complete and verified that volumes were being backed up correctly.
    a. Would like a way to exclude volumes globally, because there's no reason to back up all the istio-envoy volumes on every pod. This is unrelated to this issue, of course.
  3. Backup compled and I deleted the namespace using kubectl delete namespace database-portal-env-dev.
  4. Attempted restore using velero create restore restore-test-6 --from-backup backup-test-1 --include-namespaces database-portal-env-dev -n velero-system
  5. Waited for the restore complete, but it seemed to be stuck "InProgress".
  6. Inspected the PodVolumeRestore resources, and all seemed to be without status.
  7. Inspected the restic-wait init container logs and found many information messages about how the restore flag file was not found.

What did you expect to happen:
While I expect that some Not found: /restores/istio-envoy/.velero/18ff2bf4-eef8-4f46-b821-9da2a0d623da messages would appear during the restore process, the restore times out because of some unknown reason. No volumes are restored and the restored pods never start because they're all held by the restic-wait container.

Init Containers:
  istio-validation:
    Container ID:  docker://36eb8ab83d4ea0ce1b5e1a238ecf07bdb26c9d8602868a46788f66c48086ccf0
    Image:         docker.io/istio/proxyv2:1.7.2
    Image ID:      docker-pullable://istio/proxyv2@sha256:26b91ecc8be9cd171acbee0f8b99e99f5b1a0fb3290fe54ad3705830c10a8ff5
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x
      
      -b
      *
      -d
      15090,15021,15020
      --run-validation
      --skip-rule-apply
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 13 Oct 2020 22:17:47 -0500
      Finished:     Tue, 13 Oct 2020 22:17:48 -0500
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     10m
      memory:  10Mi
    Environment:
      DNS_AGENT:  
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from database-portal-env-dev-service-account-token-m7vqs (ro)
  restic-wait:
    Container ID:  docker://ed177704313657571c398816f85d89a835cb48a50f8a5f1d11a7a07757244a3f
    Image:         velero/velero-restic-restore-helper:v1.5.1
    Image ID:      docker-pullable://velero/velero-restic-restore-helper@sha256:a448e8ec2764742282ea5fa254660cfb164e84fe846fd26d9496c413b27ca528
    Port:          <none>
    Host Port:     <none>
    Command:
      /velero-restic-restore-helper
    Args:
      18ff2bf4-eef8-4f46-b821-9da2a0d623da
    State:          Running
      Started:      Tue, 13 Oct 2020 22:17:48 -0500
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:     100m
      memory:  128Mi
    Environment:
      POD_NAMESPACE:  database-portal-env-dev (v1:metadata.namespace)
      POD_NAME:       blog-deployment-69bd55b7fb-gg6zg (v1:metadata.name)
    Mounts:
      /restores/blog-volume from blog-volume (rw)
      /restores/istio-envoy from istio-envoy (rw)
      /restores/istio-podinfo from istio-podinfo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from database-portal-env-dev-service-account-token-m7vqs (ro)

The output of the following commands will help us better understand what's going on:

  • kubectl logs deployment/velero -n velero
    I'd like to avoid providing this for fear of revealing sensitive information about what's running in our cluster. I can provide specifics if needed. There was nothing with level > info.

  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
    link

  • velero backup logs <backupname>
    link

  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
    link

  • velero restore logs <restorename>
    Restore is still in progress.

My theory
I believe this issue reveals itself in clusters that are subject to webhook mutations that inject InitContainers into pod definitions when the pod is created by Velero when restoring. The corresponding code that velero uses to detect the start of the restic-wait container explicitly checks for the restic-wait container at InitContainer index 0. If a mutating webhook inserts a container before this point, the check fails and the restore process never begins.

In the case of our cluster, the Istio injection with CNI enabled injects a istio-validation pod in place before the restic-wait container--so it holds all of our restores. istio code where init container is added

Proposed fix
I propose that we relax the requirements in the pod restore controller to allow InitContainers before the restic-wait container, with the assumption that if this occurs the preceding InitContainers are being placed at their own risk. I also propose that we allow this, but print a warning message to the logs to inform the user that this order could lead to failed restores.

I'm open to any and all opinions and options though. I'm going to put up a pull request in the meantime to do what I describe--to check for the restic-wait container at any InitContainer position instead of expecting it to be at index 0.

Environment:

  • Velero version (use velero version):
Client:
        Version: v1.5.1
        Git commit: 87d86a45a6ca66c6c942c7c7f08352e26809426c
Server:
        Version: v1.5.1
  • Velero features (use velero client config get features):
Not set
  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-26T03:47:41Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.8-gke.6", GitCommit:"f97d54712d01b555515656671a2cf171b4f47bb8", GitTreeState:"clean", BuildDate:"2020-03-25T20:11:05Z", GoVersion:"go1.13.8b4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version:
    We're using Anthos on-premise and the gkectl client that comes with.

  • Cloud provider or hardware configuration:
    This is running in a VMWare virtualized environment.

  • OS (e.g. from /etc/os-release):
    Nodes are on Ubuntu server.

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@doughepi
Copy link
Contributor Author

adding @skriss as the original author of the linked code & because you're probably more aware of the implications of my proposed fix

@doughepi
Copy link
Contributor Author

doughepi commented Oct 16, 2020

Side note that while my pull request #3011 fixes this issue, additional issue #2997 is revealed.

@carlisia
Copy link
Contributor

This issue seems to have enough info for investigation + a PR open for review. Issues with PRs will always move faster, so expect some input soon.

@carlisia
Copy link
Contributor

@skriss since you're no longer on the project, feel free to ignore. We'll ping you if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants