calico-kube-controllers pod stuck in not Ready for 13 min #3751

hakman · 2020-07-06T10:17:07Z

In a Kubernetes cluster created with Kops, replacing the master node(s) puts the calico-kube-controllers pod in not Ready state.
It recovers on its own after about 13 min, which is quite slow.
Deleting the pod, creates a new one that becomes ready instantly.

Expected Behavior

calico-kube-controllers should recover much faster than 13 min.

Current Behavior

calico-kube-controllers waits 13 min to recover.

Possible Solution

Simplest generic fix would be to add a liveness probe that automatically restarts the pod.

Steps to Reproduce (for bugs)

Create a simple Kubernetes cluster using Kops v1.17.1, using --networking=calico.
This should provide the steps: https://kops.sigs.k8s.io/getting_started/aws/.
Build the cluster:

$ kops update cluster --yes

Validate the cluster:

kops validate cluster --wait 15m

Replace the master node:

kops rolling-update cluster --yes --cloudonly --instance-group master-a --force`

Wait for a new master to be created and check the status of the calico-kube-controllers pod.

kubectl logs -f -n kube-system calico-kube-controllers-76bd59c54c-57j6r
2020-07-06 04:38:41.397 [INFO][1] main.go 88: Loaded configuration from environment config=&config.Config{LogLevel:"info", ReconcilerPeriod:"5m", CompactionPeriod:"10m", EnabledControllers:"node", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", HealthEnabled:true, SyncNodeLabels:true, DatastoreType:"kubernetes"}
W0706 04:38:41.398065       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2020-07-06 04:38:41.398 [INFO][1] main.go 109: Ensuring Calico datastore is initialized
2020-07-06 04:38:41.409 [INFO][1] watchersyncer.go 89: Start called
2020-07-06 04:38:41.409 [INFO][1] main.go 183: Starting status report routine
2020-07-06 04:38:41.409 [INFO][1] main.go 368: Starting controller ControllerType="Node"
2020-07-06 04:38:41.409 [INFO][1] node_controller.go 130: Starting Node controller
2020-07-06 04:38:41.409 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2020-07-06 04:38:41.409 [INFO][1] node_syncer.go 39: Node controller syncer status updated: wait-for-ready
2020-07-06 04:38:41.409 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2020-07-06 04:38:41.416 [INFO][1] watchercache.go 291: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2020-07-06 04:38:41.416 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2020-07-06 04:38:41.417 [INFO][1] node_syncer.go 39: Node controller syncer status updated: resync
2020-07-06 04:38:41.417 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2020-07-06 04:38:41.417 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2020-07-06 04:38:41.417 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2020-07-06 04:38:41.417 [INFO][1] node_syncer.go 39: Node controller syncer status updated: in-sync
2020-07-06 04:38:41.509 [INFO][1] node_controller.go 143: Node controller is now running
2020-07-06 04:38:41.509 [INFO][1] ipam.go 45: Synchronizing IPAM data
2020-07-06 04:38:41.541 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2020-07-06 04:39:31.537 [INFO][1] ipam.go 45: Synchronizing IPAM data
2020-07-06 04:39:31.580 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-10-4-56-126.eu-west-1.compute.internal) with error: nodes "ip-10-4-56-126.eu-west-1.compute.internal" not found
2020-07-06 04:39:31.581 [INFO][1] ipam.go 137: Checking node calicoNode="ip-10-4-56-126.eu-west-1.compute.internal" k8sNode=""
2020-07-06 04:39:31.586 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-10-4-56-126.eu-west-1.compute.internal" k8sNode=""
2020-07-06 04:39:31.586 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.cbee49f2bf9d3ae7c4633561ccab65a8a2390d1e47b39b8b1dc572e47e6261ea'
2020-07-06 04:39:31.603 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.128/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.603 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.29600955ba32b51269bf6e9db403a3abb4d854e0f57e8958e038a82f7021a596'
2020-07-06 04:39:31.618 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.128/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.618 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.afe9203ddad2fecf4a883fb3a778f8ebd2a9174ba6c0bc291914435ec6c0054d'
2020-07-06 04:39:31.634 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.128/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.634 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'ipip-tunnel-addr-ip-10-4-56-126.eu-west-1.compute.internal'
2020-07-06 04:39:31.649 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.64/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.649 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.6330af27335563bada4a42b904340abde31da0fdb8b619339e295e3102ef1ddc'
2020-07-06 04:39:31.665 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.64/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.665 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.b7d1145f4e32b7fac254f5a38b332ff1304addb7928240109f9473d4bac7e9e1'
2020-07-06 04:39:31.680 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.109.77.64/26 host="ip-10-4-56-126.eu-west-1.compute.internal"
2020-07-06 04:39:31.736 [INFO][1] ipam.go 190: Node and IPAM data is in sync
2020-07-06 09:28:18.489 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:28:18.489 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:28:50.489 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:29:10.489 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:29:10.489 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:29:42.490 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:02.490 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:02.490 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:34.490 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:54.491 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:30:54.491 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:31:26.491 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:31:46.491 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:31:46.492 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:32:18.492 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:32:38.492 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:32:38.492 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:33:10.493 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:33:30.493 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:33:30.493 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:34:02.493 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:34:22.494 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:34:22.494 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:34:54.494 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:35:14.494 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:35:14.494 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:35:46.495 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:06.495 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:06.495 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:38.495 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:58.496 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:36:58.496 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:37:30.496 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:37:50.497 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:37:50.497 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:38:22.497 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:38:42.497 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:38:42.497 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:39:14.497 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:39:34.498 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:39:34.498 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:40:06.498 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:40:26.499 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:40:26.499 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:40:58.499 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:41:18.499 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:41:18.499 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:41:50.500 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:42:10.500 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:42:10.500 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:42:42.501 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:02.501 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:02.501 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:34.501 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:54.502 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:43:54.502 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:44:26.502 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:44:46.502 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:44:46.502 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
2020-07-06 09:45:18.503 [ERROR][1] main.go 234: Failed to reach apiserver error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
W0706 09:45:31.822278       1 reflector.go:299] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:96: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: read tcp 100.106.28.199:53668->100.64.0.1:443: read: no route to host") has prevented the request from succeeding
2020-07-06 09:45:31.822 [ERROR][1] client.go 255: Error getting cluster information config ClusterInformation="default" error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 100.106.28.199:53668->100.64.0.1:443: read: no route to host
2020-07-06 09:45:31.822 [ERROR][1] main.go 203: Failed to verify datastore error=Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: read tcp 100.106.28.199:53668->100.64.0.1:443: read: no route to host
2020-07-06 09:45:32.826 [INFO][1] ipam.go 45: Synchronizing IPAM data
2020-07-06 09:45:32.844 [INFO][1] ipam.go 281: Calico Node referenced in IPAM data does not exist error=resource does not exist: Node(ip-10-4-54-58.eu-west-1.compute.internal) with error: nodes "ip-10-4-54-58.eu-west-1.compute.internal" not found
2020-07-06 09:45:32.844 [INFO][1] ipam.go 137: Checking node calicoNode="ip-10-4-54-58.eu-west-1.compute.internal" k8sNode=""
2020-07-06 09:45:32.849 [INFO][1] ipam.go 177: Cleaning up IPAM resources for deleted node calicoNode="ip-10-4-54-58.eu-west-1.compute.internal" k8sNode=""
2020-07-06 09:45:32.849 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'ipip-tunnel-addr-ip-10-4-54-58.eu-west-1.compute.internal'
2020-07-06 09:45:32.864 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.106.118.0/26 host="ip-10-4-54-58.eu-west-1.compute.internal"
2020-07-06 09:45:32.864 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.bf8da93299c06bedff227fc91f4ef3e6193776c90f49fdd67ff23c0cbc8b582b'
2020-07-06 09:45:32.879 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.106.118.0/26 host="ip-10-4-54-58.eu-west-1.compute.internal"
2020-07-06 09:45:32.879 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.c6e7449430362a2b07b5a86fcb65302610b647e5f74e8770d50108d60bc2aa33'
2020-07-06 09:45:32.897 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.106.118.0/26 host="ip-10-4-54-58.eu-west-1.compute.internal"
2020-07-06 09:45:32.897 [INFO][1] ipam.go 1166: Releasing all IPs with handle 'k8s-pod-network.dc2c3a527c56bdd6cdd40436d679faa72ed066c3ad2e2f0c1dc4fa712b88d4c9'
2020-07-06 09:45:32.912 [INFO][1] ipam.go 1480: Node doesn't exist, no need to release affinity cidr=100.106.118.0/26 host="ip-10-4-54-58.eu-west-1.compute.internal"
2020-07-06 09:45:32.944 [INFO][1] ipam.go 190: Node and IPAM data is in sync
^C

kubectl describe pod calico-kube-controllers-76bd59c54c-57j6r -n kube-system | grep Events: -A 10
Events:
  Type     Reason     Age                 From                                               Message
  ----     ------     ----                ----                                               -------
  Warning  Unhealthy  37m                 kubelet, ip-10-4-57-46.eu-west-1.compute.internal  Readiness probe failed: Error reaching apiserver: taking a long time to check apiserver; Error verifying datastore: Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded
  Warning  Unhealthy  30m (x24 over 41m)  kubelet, ip-10-4-57-46.eu-west-1.compute.internal  Readiness probe failed: Error verifying datastore: Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded with http status code: 0
  Warning  Unhealthy  25m (x48 over 41m)  kubelet, ip-10-4-57-46.eu-west-1.compute.internal  Readiness probe failed: Error verifying datastore: Get https://100.64.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default: context deadline exceeded; Error reaching apiserver: taking a long time to check apiserver

Context

Kops validates the cluster based on the status of the kube-system pods. This issue prevents the cluster from being upgraded without manual intervention and also slows it down.

Your Environment

Calico version: 3.13.4
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes 1.17.8
Operating System and version: Ubuntu 20.04
Link to your project (optional): https://github.com/kubernetes/kops

The text was updated successfully, but these errors were encountered:

hakman · 2020-07-06T10:17:26Z

CC: @lwr20

fasaxc · 2020-07-06T10:50:17Z

@hakman do you happen to use a sticky service for the API server?

hakman · 2020-07-06T11:18:24Z

This is how the service looks on that cluster @fasaxc:

% kubectl describe service kubernetes

Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                100.64.0.1
Port:              https  443/TCP
TargetPort:        443/TCP
Endpoints:         10.4.39.68:443
Session Affinity:  None
Events:            <none>

fasaxc · 2020-07-06T12:31:04Z

10.4.39.68:443

Only one endpoint? Shouldn't it have one per control-plane node?

hakman · 2020-07-06T12:33:10Z

This is a cluster with a single master.
Happens similarly in a cluster with 3 masters. In that case, all 3 would be in the list.

fasaxc · 2020-07-09T13:12:53Z

I think another user has identified the root cause here: https://github.com/projectcalico/libcalico-go/issues/1267

hakman · 2020-07-09T13:23:03Z

Nice! Thanks for the update.

caseydavenport · 2020-08-07T21:47:00Z

I'm going to close this since we're tracking the root cause in https://github.com/projectcalico/libcalico-go/issues/1267

alok87 · 2022-01-10T06:40:25Z

This link does not work https://github.com/projectcalico/libcalico-go/issues/1267

How to go about fixing this @hakman
This happened to us whenever our master got replaced

MattLangsenkamp · 2022-02-18T23:00:12Z

I am also not able to see the link

lwr20 · 2022-02-22T17:34:16Z

I'm going to close this since we're tracking the root cause in https://github.com/projectcalico/libcalico-go/issues/1267
@caseydavenport since the move to monorepo, what's the new link to this issue?

lwr20 · 2022-02-22T17:47:40Z

FWIW, this PR claims to fix https://github.com/projectcalico/libcalico-go/issues/1267:
projectcalico/libcalico-go#1356, so it should be in recent versions of calico.

caseydavenport · 2022-02-22T17:50:49Z

Yep, the underlying issue was fixed in Calico v3.18 and should be fixed in subsequent releases as well.

I can't seem to find the original GH issue link since it was migrated, but that was the fix.

RsheikhAii3 · 2022-06-29T01:40:53Z

This is a cluster with a single master. Happens similarly in a cluster with 3 masters. In that case, all 3 would be in the list.
Yep, the underlying issue was fixed in Calico v3.18 and should be fixed in subsequent releases as well.

I can't seem to find the original GH issue link since it was migrated, but that was the fix.

I have searched in vain to find that issue link since https://github.com/projectcalico/libcalico-go/issues/1267 moved to https://github.com/projectcalico/calico/issues (no 1267) in desperation reaching out to see if you remember what the fix was ...

hakman · 2022-06-29T02:54:01Z

kubernetes/client-go#374 was the actual root cause, fixed by kubernetes/kubernetes#95981.
On the Calico side the fix was to update kubernetes/client-go to v1.20.0+.

RsheikhAii3 · 2022-06-29T03:48:35Z

kubernetes/client-go#374 was the actual root cause, fixed by kubernetes/kubernetes#95981. On the Calico side the fix was to update kubernetes/client-go to v1.20.0+.

Thank you for your reply. I debated posting this. Bit embarrassed since I am new to k8s, unsure if would waste your time. I am experiencing the

ERROR][1] client.go 272: Error getting cluster information config ClusterInformation="default" error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded
2022-06-29 01:11:48.762 [FATAL][1] main.go 124: Failed to initialize Calico datastore error=Get "https://10.96.0.1:443/apis/crd.projectcalico.org/v1/clusterinformations/default": context deadline exceeded

1 master and one worker, Ubuntu-AWS via kubeadm
Client Version: v1.23.1
Server Version: v1.23.1
GoVersion:"go1.17.5" -client & master
calico ver 3.23.1
calico-kube-controller is running on worker node

I upgraded from 1.22.1-00 to 1.23.1-00 and on right after experienced this (since I have stopped instances twice) and researching probable fix last few days..w/o any success.

kube-system calico-kube-controllers-685b65ddf9-pnqwp 0/1 CrashLoopBackOff 13 (4m47s ago) 48m
I have tried to edit timeOut values to 60 from1 on readiness and liveprobe to no avail.

I am sure you won't have time, is there a forum you could guide me to research?

Thank you in advance.

fasaxc · 2022-06-29T08:15:00Z

@RsheikhAii3 I don't think your problem is related to this issue but I'm sure we'll be able to help you on our slack: https://slack.projectcalico.org/

RsheikhAii3 · 2022-06-29T16:21:34Z

@fasaxc Much appreciate it sir, indeed you are correct I will pursue it on the slack channel. Just for doc purpose, on further search, the issue on calico-node logs was address already used could not bind for both on master and worker.

TY to all of you. Appreciate your time and knowledge when you guide newbies

joeybdub · 2022-10-13T09:32:31Z

Any update on the issue?

hakman · 2022-10-13T09:56:02Z

@joeybdub As mentioned before, this IS fixed. If you have a similar issue, it's just something that looks similar, nothing more.
Would be best to create a new issue or try via Slack, as you may get help faster. There are some really cool and helpful people there. 😉

joeybdub · 2022-10-13T10:11:24Z

Thanks @hakman there is an issue already for the issue experiencing Azure/AKS#2745

hakman · 2022-10-13T11:18:27Z

@joeybdub The AKS issue seems unrelated. Your best guess is still Slack where there may be someone more familiar with AKS that can help. Good luck!

caseydavenport closed this as completed Aug 7, 2020

hakman mentioned this issue Nov 10, 2020

system-cluster-critical pod "calico-kube-controllers" is not yet ready over 10 minutes when performing rolling update kubernetes/kops#10207

Closed

This was referenced Jan 22, 2021

kubetest2 - increase validation timeout for the upgrade scenario kubernetes/kops#10632

Merged

Add startup probe for calico-kube-controllers kubernetes/kops#10633

Merged

hakman mentioned this issue Feb 7, 2021

calico-kube-controllers stuck not ready during rolling update. kubernetes/kops#10737

Closed

This was referenced Feb 17, 2021

Add liveness probe for calico-kube-controllers kubernetes/kops#10856

Merged

Add liveness probe for calico-kube-controllers #4407

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calico-kube-controllers pod stuck in not Ready for 13 min #3751

calico-kube-controllers pod stuck in not Ready for 13 min #3751

hakman commented Jul 6, 2020

hakman commented Jul 6, 2020

fasaxc commented Jul 6, 2020

hakman commented Jul 6, 2020

fasaxc commented Jul 6, 2020

hakman commented Jul 6, 2020

fasaxc commented Jul 9, 2020

hakman commented Jul 9, 2020

caseydavenport commented Aug 7, 2020

alok87 commented Jan 10, 2022

MattLangsenkamp commented Feb 18, 2022

lwr20 commented Feb 22, 2022

lwr20 commented Feb 22, 2022 •

edited

Loading

caseydavenport commented Feb 22, 2022

RsheikhAii3 commented Jun 29, 2022

hakman commented Jun 29, 2022

RsheikhAii3 commented Jun 29, 2022

fasaxc commented Jun 29, 2022

RsheikhAii3 commented Jun 29, 2022

joeybdub commented Oct 13, 2022

hakman commented Oct 13, 2022

joeybdub commented Oct 13, 2022

hakman commented Oct 13, 2022

calico-kube-controllers pod stuck in not Ready for 13 min #3751

calico-kube-controllers pod stuck in not Ready for 13 min #3751

Comments

hakman commented Jul 6, 2020

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

hakman commented Jul 6, 2020

fasaxc commented Jul 6, 2020

hakman commented Jul 6, 2020

fasaxc commented Jul 6, 2020

hakman commented Jul 6, 2020

fasaxc commented Jul 9, 2020

hakman commented Jul 9, 2020

caseydavenport commented Aug 7, 2020

alok87 commented Jan 10, 2022

MattLangsenkamp commented Feb 18, 2022

lwr20 commented Feb 22, 2022

lwr20 commented Feb 22, 2022 • edited Loading

caseydavenport commented Feb 22, 2022

RsheikhAii3 commented Jun 29, 2022

hakman commented Jun 29, 2022

RsheikhAii3 commented Jun 29, 2022

fasaxc commented Jun 29, 2022

RsheikhAii3 commented Jun 29, 2022

joeybdub commented Oct 13, 2022

hakman commented Oct 13, 2022

joeybdub commented Oct 13, 2022

hakman commented Oct 13, 2022

lwr20 commented Feb 22, 2022 •

edited

Loading