Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capi-controller-manager crashes after a velero backup/restore #4105

Closed
yastij opened this issue Jan 22, 2021 · 9 comments
Closed

capi-controller-manager crashes after a velero backup/restore #4105

yastij opened this issue Jan 22, 2021 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@yastij
Copy link
Member

yastij commented Jan 22, 2021

What steps did you take and what happened:

After doing a velero backup/restore of a cluster, the capi-controller-manager crashes with the following logs:

k logs -n capi-system capi-controller-manager-86f86fb9df-j8zdq manager
I1222 08:01:26.006431       1 listener.go:44] controller-runtime/metrics "msg"="metrics server is starting to listen"  "addr"="127.0.0.1:8080"
I1222 08:01:26.016420       1 main.go:181] setup "msg"="starting manager"  "version"=""
I1222 08:01:26.019379       1 internal.go:356] controller-runtime/manager "msg"="starting metrics server"  "path"="/metrics"
I1222 08:01:26.019675       1 leaderelection.go:242] attempting to acquire leader lease  capi-system/controller-leader-election-capi...
I1222 08:01:43.257226       1 leaderelection.go:252] successfully acquired lease capi-system/controller-leader-election-capi
I1222 08:01:43.295187       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="cluster" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.300667       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machineset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"template":{"metadata":{},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}}}},"status":{}}}
I1222 08:01:43.300776       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="cluster" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.301038       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinehealthcheck" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"unhealthyConditions":null},"status":{}}}
I1222 08:01:43.301726       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machine" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.302082       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourceset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterSelector":{}},"status":{}}}
I1222 08:01:43.302834       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinedeployment" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"template":{"metadata":{},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}}}},"status":{}}}
I1222 08:01:43.303042       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourcesetbinding" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{}}}
I1222 08:01:43.311870       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machine" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.502237       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourcesetbinding" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.503332       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="cluster"
I1222 08:01:43.503504       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="machine"
I1222 08:01:43.503704       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="cluster" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.503857       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="cluster"
I1222 08:01:43.504082       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourceset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.504140       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourceset" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I1222 08:01:43.504648       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machineset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.504726       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machineset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.504793       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machineset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.504879       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="machineset"
I1222 08:01:43.590813       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinedeployment" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"template":{"metadata":{},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}}}},"status":{}}}
I1222 08:01:43.590870       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinedeployment" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"template":{"metadata":{},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}}}},"status":{}}}
I1222 08:01:43.591000       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinedeployment" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.591028       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="machinedeployment"
I1222 08:01:43.591155       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="machinedeployment" "worker count"=10
I1222 08:01:43.591232       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinehealthcheck" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.591292       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinehealthcheck" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.591333       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="machinehealthcheck"
I1222 08:01:43.591358       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="machinehealthcheck" "worker count"=10
I1222 08:01:43.651666       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="clusterresourcesetbinding"
I1222 08:01:43.651775       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="cluster" "worker count"=10
I1222 08:01:43.652655       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourceset" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I1222 08:01:43.667694       1 machinehealthcheck_controller.go:108] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="mgmtrestore" "namespace"="tkg-system"
I1222 08:01:43.671472       1 machinehealthcheck_controller.go:108] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="restworker" "namespace"="default"
I1222 08:01:43.952431       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="cluster" "worker count"=10
I1222 08:01:43.960046       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="machine" "worker count"=10
I1222 08:01:44.118595       1 tracker.go:55] controllers/Cluster "msg"="Adding watcher on external object" "cluster"="mgmtrestore" "namespace"="tkg-system" "GroupVersionKind"="infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereCluster"
I1222 08:01:44.118706       1 controller.go:132] controller-runtime/controller "msg"="Starting EventSource" "controller"="cluster" "source"={"Type":{"apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha3","kind":"VSphereCluster"}}
I1222 08:01:44.228848       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="clusterresourcesetbinding" "worker count"=10
E1222 08:01:44.232128       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 608 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x163e040, 0x26dfda0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x82
panic(0x163e040, 0x26dfda0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
sigs.k8s.io/cluster-api/exp/addons/controllers.(*ClusterResourceSetBindingReconciler).Reconcile(0xc000392060, 0xc0007a2299, 0x7, 0xc0007a2280, 0xa, 0xc0008bc1c8, 0x1a4b9c0, 0xc0008bc1b0, 0xc000971cd8)
	/workspace/exp/addons/controllers/clusterresourcesetbinding_controller.go:83 +0x499
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00014a580, 0x1694e60, 0xc0008b2e00, 0xc0007aab00)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00014a580, 0x0)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00014a580)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:210 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000b25250)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000b25250, 0x3b9aca00, 0x0, 0x1, 0xc0001320c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc000b25250, 0x3b9aca00, 0xc0001320c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0x468
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa0 pc=0x14cf8b9]

goroutine 608 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0x105
panic(0x163e040, 0x26dfda0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
sigs.k8s.io/cluster-api/exp/addons/controllers.(*ClusterResourceSetBindingReconciler).Reconcile(0xc000392060, 0xc0007a2299, 0x7, 0xc0007a2280, 0xa, 0xc0008bc1c8, 0x1a4b9c0, 0xc0008bc1b0, 0xc000971cd8)
	/workspace/exp/addons/controllers/clusterresourcesetbinding_controller.go:83 +0x499
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00014a580, 0x1694e60, 0xc0008b2e00, 0xc0007aab00)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00014a580, 0x0)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00014a580)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:210 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000b25250)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000b25250, 0x3b9aca00, 0x0, 0x1, 0xc0001320c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc000b25250, 0x3b9aca00, 0xc0001320c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0x468

This happens here

cluster, err := util.GetOwnerCluster(ctx, r.Client, binding.ObjectMeta)
if err != nil && !apierrors.IsNotFound(err) {
return ctrl.Result{}, err
}
// If the owner cluster is already deleted or in deletion process, delete its ClusterResourceSetBinding
if apierrors.IsNotFound(err) || !cluster.DeletionTimestamp.IsZero() {

where GetOwnerCluster would return nil,nil when no cluster ownerRef is found. leading to cluster being nil

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api version: commit a74685ee93cd453a435e4f999526c606948bbf73 (this is valid for the main branch too)
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 22, 2021
@Lirt
Copy link

Lirt commented Feb 2, 2021

Just a note for anybody taking this. Velero cannot restore .Status field vmware-tanzu/velero#1687.

@fabriziopandini
Copy link
Member

My understanding of the Issue is that:

  • By default velero restore ClusterResourceSetBindings before Clusters
  • This creates an unexpected condition (a ClusterResourceSetBinding not owned by any cluster) and if a reconcile happen while the restore is still in progress, the ClusterResourceSetBinding controller panic.

A first workaround of the issue is to instruct velero to restore clusters before any other CAP* object.

A more long term solution is to make ClusterResourceSetBinding controller tolerant to this condition, assuming that this is temporary and everything will be fixed by velero as soon as the restore completes

@yastij @nrb please chime in if; most specificaly I have some doubts WRT to the assumption in the latest sentences

@vincepri
Copy link
Member

vincepri commented Feb 3, 2021

To add to what @fabriziopandini mentioned above, all .Status fields should be recreatable on first reconciliation

@vincepri
Copy link
Member

vincepri commented Feb 4, 2021

/milestone Next
/priority awaiting-more-evidence

@k8s-ci-robot k8s-ci-robot added this to the Next milestone Feb 4, 2021
@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Feb 4, 2021
@nrb
Copy link
Contributor

nrb commented Feb 4, 2021

@fabriziopandini Your understanding matches mine. My expectation would be that the ClusterResourceSetBinding controller would at least retry for a Cluster ownerRef before failing completely, and failure wouldn't be a panic.

The source of the issue from the Velero side is Velero doesn't have knowledge of operator object graphs other than some included Kubernetes resources. Instead, Velero attempts to restore API groups in alphabetical order, which does not always align with the object graph. We do have an argument on the Velero server called --restore-resource-priorities that puts API groups at the top of the restore list and can resolve the alphabetical ordering issue.

We're looking at re-architecting the logic to allow for object graphs, but that likely will be many releases out.

To @vincepri's point, yes, Velero never restores the .Status fields, expecting that operators/controllers can reconcile them.

@vincepri
Copy link
Member

vincepri commented Feb 4, 2021

@nrb Do you want us to keep this issue open, or track it on Velero's side?

@vincepri
Copy link
Member

vincepri commented Feb 9, 2021

Closing this for now, given that it seems something that should be solved from Velero side. Feel free to reopen if necessary.

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Closing this for now, given that it seems something that should be solved from Velero side. Feel free to reopen if necessary.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nrb
Copy link
Contributor

nrb commented Feb 9, 2021

@vincepri Are users of CAPI expected to only use CAPI tools to manipulate clusters? Because this would be an issue with kubectl apply, too. I think a panic on an empty field is still a CAPI controller bug, especially when submitting Kube resources to the API server is not limited to a single client.

Velero can work around it and restore in the proper order, but this is not something we're going to be able to do for every operator/controller. We're not going to be able to keep track of operators as they change, either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

6 participants