Cluster is not normally created due to delay in HAProxy configuration in CAPV v0.6.2. #870

moonek · 2020-03-25T08:42:38Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

In most attempts, only haproxy and controlplane nodes are created, not worker nodes.

NAME                                                                     AGE
haproxyloadbalancer.infrastructure.cluster.x-k8s.io/vsphere-quickstart   24m

NAME                                                               PROVIDERID                                       PHASE
machine.cluster.x-k8s.io/vsphere-quickstart-gbr84                  vsphere://42191e50-0a4f-1dd0-b03c-a23d14ef90f5   Provisioning
machine.cluster.x-k8s.io/vsphere-quickstart-md-0-76675f574-5n9k6                                                    Pending

As a result of the trace, it is understood that the communication timeout for the controlPlaneEndpoint(HAProxy) occurred during the kubeadm init process (timeout: 4m0s).

Over time, communication with the controlPlaneEndpoint is successful, but coredns and kube-proxy are not installed due to kubeadm init failure.

This is a bug that did not occur in CAPV v0.6.1 and only in v0.6.2.

What did you expect to happen:
kubeadm init should succeed.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

This is the contolplane node log where kubeadm init failed.

Mar 25 07:56:03 vsphere-quickstart-gbr84 cloud-init: W0325 07:56:03.136945    1793 validation.go:28] Cannot validate kube-proxy config - no validator is available
Mar 25 07:56:03 vsphere-quickstart-gbr84 cloud-init: W0325 07:56:03.136956    1793 validation.go:28] Cannot validate kubelet config - no validator is available
Mar 25 07:56:03 vsphere-quickstart-gbr84 cloud-init: [init] Using Kubernetes version: v1.17.3
Mar 25 07:56:03 vsphere-quickstart-gbr84 cloud-init: [preflight] Running pre-flight checks
Mar 25 07:56:03 vsphere-quickstart-gbr84 cloud-init: [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
Mar 25 07:56:03 vsphere-quickstart-gbr84 cloud-init: [preflight] Pulling images required for setting up a Kubernetes cluster
Mar 25 07:56:03 vsphere-quickstart-gbr84 cloud-init: [preflight] This might take a minute or two, depending on the speed of your internet connection
Mar 25 07:56:03 vsphere-quickstart-gbr84 cloud-init: [preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
Mar 25 07:56:04 vsphere-quickstart-gbr84 cloud-init: [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
Mar 25 07:56:04 vsphere-quickstart-gbr84 cloud-init: [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
Mar 25 07:56:04 vsphere-quickstart-gbr84 cloud-init: [kubelet-start] Starting the kubelet
Mar 25 07:56:04 vsphere-quickstart-gbr84 cloud-init: [certs] Using certificateDir folder "/etc/kubernetes/pki"
Mar 25 07:56:04 vsphere-quickstart-gbr84 cloud-init: [certs] Using existing ca certificate authority
Mar 25 07:56:05 vsphere-quickstart-gbr84 cloud-init: [certs] Generating "apiserver" certificate and key
Mar 25 07:56:05 vsphere-quickstart-gbr84 cloud-init: [certs] apiserver serving cert is signed for DNS names [vsphere-quickstart-gbr84 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.60.31.227 10.60.31.238]
Mar 25 07:56:05 vsphere-quickstart-gbr84 cloud-init: [certs] Generating "apiserver-kubelet-client" certificate and key
Mar 25 07:56:05 vsphere-quickstart-gbr84 cloud-init: [certs] Using existing front-proxy-ca certificate authority
Mar 25 07:56:05 vsphere-quickstart-gbr84 cloud-init: [certs] Generating "front-proxy-client" certificate and key
Mar 25 07:56:05 vsphere-quickstart-gbr84 cloud-init: [certs] Using existing etcd/ca certificate authority
Mar 25 07:53:07 vsphere-quickstart-gbr84 cloud-init: [certs] Generating "etcd/server" certificate and key
Mar 25 07:53:07 vsphere-quickstart-gbr84 cloud-init: [certs] etcd/server serving cert is signed for DNS names [vsphere-quickstart-gbr84 localhost] and IPs [10.60.31.227 127.0.0.1 ::1]
Mar 25 07:53:07 vsphere-quickstart-gbr84 cloud-init: [certs] Generating "etcd/peer" certificate and key
Mar 25 07:53:07 vsphere-quickstart-gbr84 cloud-init: [certs] etcd/peer serving cert is signed for DNS names [vsphere-quickstart-gbr84 localhost] and IPs [10.60.31.227 127.0.0.1 ::1]
Mar 25 07:53:07 vsphere-quickstart-gbr84 cloud-init: [certs] Generating "etcd/healthcheck-client" certificate and key
Mar 25 07:53:07 vsphere-quickstart-gbr84 cloud-init: [certs] Generating "apiserver-etcd-client" certificate and key
Mar 25 07:53:07 vsphere-quickstart-gbr84 cloud-init: [certs] Using the existing "sa" key
Mar 25 07:53:07 vsphere-quickstart-gbr84 cloud-init: [kubeconfig] Using kubeconfig folder "/etc/kubernetes"
Mar 25 07:53:08 vsphere-quickstart-gbr84 cloud-init: [kubeconfig] Writing "admin.conf" kubeconfig file
Mar 25 07:53:08 vsphere-quickstart-gbr84 cloud-init: [kubeconfig] Writing "kubelet.conf" kubeconfig file
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: [kubeconfig] Writing "controller-manager.conf" kubeconfig file
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: [kubeconfig] Writing "scheduler.conf" kubeconfig file
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: [control-plane] Using manifest folder "/etc/kubernetes/manifests"
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: [control-plane] Creating static Pod manifest for "kube-apiserver"
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: [control-plane] Creating static Pod manifest for "kube-controller-manager"
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: W0325 07:53:09.267906    1793 manifests.go:214] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC"
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: [control-plane] Creating static Pod manifest for "kube-scheduler"
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: W0325 07:53:09.269338    1793 manifests.go:214] the default kube-apiserver authorization-mode is "Node,RBAC"; using "Node,RBAC"
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
Mar 25 07:53:09 vsphere-quickstart-gbr84 cloud-init: [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
Mar 25 07:53:49 vsphere-quickstart-gbr84 cloud-init: [kubelet-check] Initial timeout of 40s passed.
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: Unfortunately, an error has occurred:
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: timed out waiting for the condition
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: This error is likely caused by:
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: - The kubelet is not running
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: - 'systemctl status kubelet'
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: - 'journalctl -xeu kubelet'
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: Additionally, a control plane component may have crashed or exited when started by the container runtime.
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: Here is one example how you may list all Kubernetes containers running in docker:
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: - 'docker ps -a | grep kube | grep -v pause'
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: Once you have found the failing container, you can inspect its logs with:
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: - 'docker logs CONTAINERID'
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: To see the stack trace of this error execute with --v=5 or higher
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: 2020-03-25 07:57:09,287 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: 2020-03-25 07:57:09,289 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: 2020-03-25 07:57:09,290 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
Mar 25 07:57:09 vsphere-quickstart-gbr84 cloud-init: Cloud-init v. 18.5 finished at Wed, 25 Mar 2020 07:57:09 +0000. Datasource DataSourceVMwareGuestInfo.  Up 250.98 seconds

A log of communication failures from the kubelet to the controlPlaneEndpoint is being logged even after the kubeadm init is over.

Mar 25 07:57:57 vsphere-quickstart-gbr84 kubelet[11900]: E0325 07:57:57.927013   11900 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://10.60.31.238:6443/api/v1/nodes?fieldSelector=metadata.name%3Dvsphere-quickstart-gbr84&limit=500&resourceVersion=0: EOF
Mar 25 07:57:57 vsphere-quickstart-gbr84 kubelet[11900]: E0325 07:57:57.927888   11900 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:449: Failed to list *v1.Service: Get https://10.60.31.238:6443/api/v1/services?limit=500&resourceVersion=0: EOF
Mar 25 07:57:57 vsphere-quickstart-gbr84 kubelet[11900]: E0325 07:57:57.929063   11900 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://10.60.31.238:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dvsphere-quickstart-gbr84&limit=500&resourceVersion=0: EOF
Mar 25 07:57:58 vsphere-quickstart-gbr84 kubelet[11900]: E0325 07:57:58.928522   11900 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://10.60.31.238:6443/api/v1/nodes?fieldSelector=metadata.name%3Dvsphere-quickstart-gbr84&limit=500&resourceVersion=0: EOF
Mar 25 07:57:58 vsphere-quickstart-gbr84 kubelet[11900]: E0325 07:57:58.928814   11900 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/kubelet.go:449: Failed to list *v1.Service: Get https://10.60.31.238:6443/api/v1/services?limit=500&resourceVersion=0: EOF
Mar 25 07:57:58 vsphere-quickstart-gbr84 kubelet[11900]: E0325 07:57:58.930204   11900 reflector.go:153] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://10.60.31.238:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dvsphere-quickstart-gbr84&limit=500&resourceVersion=0: EOF

I have confirmed that the communication to localhost:6443 is normally performed without going through the controlPlaneEndpoint, and after time, it naturally communicates with the controlPlaneEndpoint.

But I cannot install cni plugin because kube-proxy is not installed.
Therefore, worker nodes are not created forever.

Environment:

Cluster-api-provider-vsphere version: v0.6.2
Kubernetes version: (use kubectl version): v1.17.3
OS (e.g. from /etc/os-release): CentOS Linux release 7.7.1908 (Core)

The text was updated successfully, but these errors were encountered:

yastij · 2020-03-25T12:09:10Z

@moonek - these issues were fixed in master and we also have kubernetes-sigs/cluster-api#2763 in CAPI that fixes some join issues.

we'll likely cut a CAPV release today

moonek · 2020-03-25T13:25:56Z

@yastij Okay. I will test it as soon as it is released.

yastij · 2020-03-31T16:35:54Z

@moonek - can you retry with v0.6.3 ?

moonek · 2020-04-01T02:58:47Z

@yastij yes, I tested it several times today(with capi v0.3.3 and capv v0.6.3). It works very well. good!

moonek · 2020-04-01T03:00:52Z

I confirmed that this problem has been fixed in capv v0.6.3.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 25, 2020

moonek closed this as completed Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster is not normally created due to delay in HAProxy configuration in CAPV v0.6.2. #870

Cluster is not normally created due to delay in HAProxy configuration in CAPV v0.6.2. #870

moonek commented Mar 25, 2020

yastij commented Mar 25, 2020

moonek commented Mar 25, 2020

yastij commented Mar 31, 2020

moonek commented Apr 1, 2020

moonek commented Apr 1, 2020

Cluster is not normally created due to delay in HAProxy configuration in CAPV v0.6.2. #870

Cluster is not normally created due to delay in HAProxy configuration in CAPV v0.6.2. #870

Comments

moonek commented Mar 25, 2020

yastij commented Mar 25, 2020

moonek commented Mar 25, 2020

yastij commented Mar 31, 2020

moonek commented Apr 1, 2020

moonek commented Apr 1, 2020