capi-controller-manager crashes after a velero backup/restore #4105

yastij · 2021-01-22T17:22:48Z

What steps did you take and what happened:

After doing a velero backup/restore of a cluster, the capi-controller-manager crashes with the following logs:

k logs -n capi-system capi-controller-manager-86f86fb9df-j8zdq manager
I1222 08:01:26.006431       1 listener.go:44] controller-runtime/metrics "msg"="metrics server is starting to listen"  "addr"="127.0.0.1:8080"
I1222 08:01:26.016420       1 main.go:181] setup "msg"="starting manager"  "version"=""
I1222 08:01:26.019379       1 internal.go:356] controller-runtime/manager "msg"="starting metrics server"  "path"="/metrics"
I1222 08:01:26.019675       1 leaderelection.go:242] attempting to acquire leader lease  capi-system/controller-leader-election-capi...
I1222 08:01:43.257226       1 leaderelection.go:252] successfully acquired lease capi-system/controller-leader-election-capi
I1222 08:01:43.295187       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="cluster" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.300667       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machineset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"template":{"metadata":{},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}}}},"status":{}}}
I1222 08:01:43.300776       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="cluster" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.301038       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinehealthcheck" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"unhealthyConditions":null},"status":{}}}
I1222 08:01:43.301726       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machine" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.302082       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourceset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterSelector":{}},"status":{}}}
I1222 08:01:43.302834       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinedeployment" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"template":{"metadata":{},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}}}},"status":{}}}
I1222 08:01:43.303042       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourcesetbinding" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{}}}
I1222 08:01:43.311870       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machine" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.502237       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourcesetbinding" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.503332       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="cluster"
I1222 08:01:43.503504       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="machine"
I1222 08:01:43.503704       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="cluster" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.503857       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="cluster"
I1222 08:01:43.504082       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourceset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.504140       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourceset" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I1222 08:01:43.504648       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machineset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.504726       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machineset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.504793       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machineset" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.504879       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="machineset"
I1222 08:01:43.590813       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinedeployment" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"template":{"metadata":{},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}}}},"status":{}}}
I1222 08:01:43.590870       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinedeployment" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","selector":{},"template":{"metadata":{},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}}}},"status":{}}}
I1222 08:01:43.591000       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinedeployment" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.591028       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="machinedeployment"
I1222 08:01:43.591155       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="machinedeployment" "worker count"=10
I1222 08:01:43.591232       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinehealthcheck" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"clusterName":"","bootstrap":{},"infrastructureRef":{}},"status":{"bootstrapReady":false,"infrastructureReady":false}}}
I1222 08:01:43.591292       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="machinehealthcheck" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"controlPlaneEndpoint":{"host":"","port":0}},"status":{"infrastructureReady":false,"controlPlaneInitialized":false}}}
I1222 08:01:43.591333       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="machinehealthcheck"
I1222 08:01:43.591358       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="machinehealthcheck" "worker count"=10
I1222 08:01:43.651666       1 controller.go:162] controller-runtime/controller "msg"="Starting Controller" "controller"="clusterresourcesetbinding"
I1222 08:01:43.651775       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="cluster" "worker count"=10
I1222 08:01:43.652655       1 controller.go:155] controller-runtime/controller "msg"="Starting EventSource" "controller"="clusterresourceset" "source"={"Type":{"metadata":{"creationTimestamp":null}}}
I1222 08:01:43.667694       1 machinehealthcheck_controller.go:108] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="mgmtrestore" "namespace"="tkg-system"
I1222 08:01:43.671472       1 machinehealthcheck_controller.go:108] controllers/MachineHealthCheck "msg"="Reconciling" "machinehealthcheck"="restworker" "namespace"="default"
I1222 08:01:43.952431       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="cluster" "worker count"=10
I1222 08:01:43.960046       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="machine" "worker count"=10
I1222 08:01:44.118595       1 tracker.go:55] controllers/Cluster "msg"="Adding watcher on external object" "cluster"="mgmtrestore" "namespace"="tkg-system" "GroupVersionKind"="infrastructure.cluster.x-k8s.io/v1alpha3, Kind=VSphereCluster"
I1222 08:01:44.118706       1 controller.go:132] controller-runtime/controller "msg"="Starting EventSource" "controller"="cluster" "source"={"Type":{"apiVersion":"infrastructure.cluster.x-k8s.io/v1alpha3","kind":"VSphereCluster"}}
I1222 08:01:44.228848       1 controller.go:189] controller-runtime/controller "msg"="Starting workers" "controller"="clusterresourcesetbinding" "worker count"=10
E1222 08:01:44.232128       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 608 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x163e040, 0x26dfda0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:48 +0x82
panic(0x163e040, 0x26dfda0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
sigs.k8s.io/cluster-api/exp/addons/controllers.(*ClusterResourceSetBindingReconciler).Reconcile(0xc000392060, 0xc0007a2299, 0x7, 0xc0007a2280, 0xa, 0xc0008bc1c8, 0x1a4b9c0, 0xc0008bc1b0, 0xc000971cd8)
	/workspace/exp/addons/controllers/clusterresourcesetbinding_controller.go:83 +0x499
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00014a580, 0x1694e60, 0xc0008b2e00, 0xc0007aab00)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00014a580, 0x0)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00014a580)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:210 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000b25250)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000b25250, 0x3b9aca00, 0x0, 0x1, 0xc0001320c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc000b25250, 0x3b9aca00, 0xc0001320c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0x468
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa0 pc=0x14cf8b9]

goroutine 608 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:55 +0x105
panic(0x163e040, 0x26dfda0)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
sigs.k8s.io/cluster-api/exp/addons/controllers.(*ClusterResourceSetBindingReconciler).Reconcile(0xc000392060, 0xc0007a2299, 0x7, 0xc0007a2280, 0xa, 0xc0008bc1c8, 0x1a4b9c0, 0xc0008bc1b0, 0xc000971cd8)
	/workspace/exp/addons/controllers/clusterresourcesetbinding_controller.go:83 +0x499
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00014a580, 0x1694e60, 0xc0008b2e00, 0xc0007aab00)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:255 +0x162
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00014a580, 0x0)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231 +0xcb
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00014a580)
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:210 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000b25250)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000b25250, 0x3b9aca00, 0x0, 0x1, 0xc0001320c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc000b25250, 0x3b9aca00, 0xc0001320c0)
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0x468

This happens here

cluster-api/exp/addons/controllers/clusterresourcesetbinding_controller.go

Lines 75 to 81 in e851972

    
           cluster, err := util.GetOwnerCluster(ctx, r.Client, binding.ObjectMeta) 
        
           if err != nil && !apierrors.IsNotFound(err) { 
        
           	return ctrl.Result{}, err 
        
           } 
        
           // If the owner cluster is already deleted or in deletion process, delete its ClusterResourceSetBinding 
        
           if apierrors.IsNotFound(err) || !cluster.DeletionTimestamp.IsZero() {

where GetOwnerCluster would return nil,nil when no cluster ownerRef is found. leading to cluster being nil

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version: commit a74685ee93cd453a435e4f999526c606948bbf73 (this is valid for the main branch too)
Minikube/KIND version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

Lirt · 2021-02-02T00:05:18Z

Just a note for anybody taking this. Velero cannot restore .Status field vmware-tanzu/velero#1687.

fabriziopandini · 2021-02-02T20:53:07Z

My understanding of the Issue is that:

By default velero restore ClusterResourceSetBindings before Clusters
This creates an unexpected condition (a ClusterResourceSetBinding not owned by any cluster) and if a reconcile happen while the restore is still in progress, the ClusterResourceSetBinding controller panic.

A first workaround of the issue is to instruct velero to restore clusters before any other CAP* object.

A more long term solution is to make ClusterResourceSetBinding controller tolerant to this condition, assuming that this is temporary and everything will be fixed by velero as soon as the restore completes

@yastij @nrb please chime in if; most specificaly I have some doubts WRT to the assumption in the latest sentences

vincepri · 2021-02-03T22:03:50Z

To add to what @fabriziopandini mentioned above, all .Status fields should be recreatable on first reconciliation

vincepri · 2021-02-04T20:57:25Z

/milestone Next
/priority awaiting-more-evidence

nrb · 2021-02-04T22:16:40Z

@fabriziopandini Your understanding matches mine. My expectation would be that the ClusterResourceSetBinding controller would at least retry for a Cluster ownerRef before failing completely, and failure wouldn't be a panic.

The source of the issue from the Velero side is Velero doesn't have knowledge of operator object graphs other than some included Kubernetes resources. Instead, Velero attempts to restore API groups in alphabetical order, which does not always align with the object graph. We do have an argument on the Velero server called --restore-resource-priorities that puts API groups at the top of the restore list and can resolve the alphabetical ordering issue.

We're looking at re-architecting the logic to allow for object graphs, but that likely will be many releases out.

To @vincepri's point, yes, Velero never restores the .Status fields, expecting that operators/controllers can reconcile them.

vincepri · 2021-02-04T22:55:15Z

@nrb Do you want us to keep this issue open, or track it on Velero's side?

vincepri · 2021-02-09T16:36:51Z

Closing this for now, given that it seems something that should be solved from Velero side. Feel free to reopen if necessary.

/close

k8s-ci-robot · 2021-02-09T16:36:58Z

@vincepri: Closing this issue.

In response to this:

Closing this for now, given that it seems something that should be solved from Velero side. Feel free to reopen if necessary.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nrb · 2021-02-09T16:57:34Z

@vincepri Are users of CAPI expected to only use CAPI tools to manipulate clusters? Because this would be an issue with kubectl apply, too. I think a panic on an empty field is still a CAPI controller bug, especially when submitting Kube resources to the API server is not limited to a single client.

Velero can work around it and restore in the proper order, but this is not something we're going to be able to do for every operator/controller. We're not going to be able to keep track of operators as they change, either.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 22, 2021

k8s-ci-robot added this to the Next milestone Feb 4, 2021

k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Feb 4, 2021

k8s-ci-robot closed this as completed Feb 9, 2021

nrb mentioned this issue Feb 10, 2021

Restore CAPI cluster objects in a better order vmware-tanzu/velero#3446

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

capi-controller-manager crashes after a velero backup/restore #4105

capi-controller-manager crashes after a velero backup/restore #4105

yastij commented Jan 22, 2021

Lirt commented Feb 2, 2021

fabriziopandini commented Feb 2, 2021

vincepri commented Feb 3, 2021

vincepri commented Feb 4, 2021

nrb commented Feb 4, 2021

vincepri commented Feb 4, 2021

vincepri commented Feb 9, 2021

k8s-ci-robot commented Feb 9, 2021

nrb commented Feb 9, 2021

capi-controller-manager crashes after a velero backup/restore #4105

capi-controller-manager crashes after a velero backup/restore #4105

Comments

yastij commented Jan 22, 2021

Lirt commented Feb 2, 2021

fabriziopandini commented Feb 2, 2021

vincepri commented Feb 3, 2021

vincepri commented Feb 4, 2021

nrb commented Feb 4, 2021

vincepri commented Feb 4, 2021

vincepri commented Feb 9, 2021

k8s-ci-robot commented Feb 9, 2021

nrb commented Feb 9, 2021