Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm join control-plane node times out (etcd timeout) #1712

Closed
chrischdi opened this issue Aug 7, 2019 · 29 comments
Closed

kubeadm join control-plane node times out (etcd timeout) #1712

chrischdi opened this issue Aug 7, 2019 · 29 comments
Assignees
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@chrischdi
Copy link
Member

chrischdi commented Aug 7, 2019

What keywords did you search in kubeadm issues before filing this one?

etcd join timeout
kubeadm join timeout

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version): kubeadm version: &version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.2", GitCommit:"f6278300bebbb750328ac16ee6dd3aa7d3549568", GitTreeState:"clean", BuildDate:"2019-08-05T09:20:51Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version): v1.15.2

  • Cloud provider or hardware configuration: Openstack

  • OS (e.g. from /etc/os-release): Container Linux by CoreOS 2135.5.0 (Rhyolite)

  • Kernel (e.g. uname -a): Linux os1pi019-kube-master01 4.19.50-coreos-r1 kubeadm join on slave node fails preflight checks #1 SMP Mon Jul 1 19:07:03 -00 2019 x86_64 Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz GenuineIntel GNU/Linux

  • Others:

What happened?

kubeadm join was invoced and failed.
The etcd container did start up 7 seconds after kubeadm timed out / did exit with failure.
See the following logs (this include kubeadm logs and timestamps for pod-manifest starts):

09:30:27 kubeadm service starts
09:30:27 kubeadm[2025]: [preflight] Reading configuration from the cluster...                                                                                                                                                                                                                                                                                   
09:30:27 kubeadm[2025]: [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'                                                                                                                                                                                                                            
09:30:27 kubeadm[2025]: [control-plane] Using manifest folder "/etc/kubernetes/manifests"                                                                                                                                                                                                                                                                       
09:30:27 kubeadm[2025]: [control-plane] Creating static Pod manifest for "kube-apiserver"                                                                                                                                                                                                                                                                       
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "audit-policy" to "kube-apiserver"                                                                                                                                                                                                                                                          
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "policy-controller" to "kube-apiserver"                                                                                                                                                                                                                                                     
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "audit-log" to "kube-apiserver"                                                                                                                                                                                                                                                             
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "scheduler-policy" to "kube-scheduler"                                                                                                                                                                                                                                                      
09:30:27 kubeadm[2025]: [control-plane] Creating static Pod manifest for "kube-controller-manager"                                                                                                                                                                                                                                                              
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "audit-policy" to "kube-apiserver"                                                                                                                                                                                                                                                          
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "policy-controller" to "kube-apiserver"                                                                                                                                                                                                                                                     
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "audit-log" to "kube-apiserver"                                                                                                                                                                                                                                                             
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "scheduler-policy" to "kube-scheduler"                                                                                                                                                                                                                                                      
09:30:27 kubeadm[2025]: [control-plane] Creating static Pod manifest for "kube-scheduler"                                                                                                                                                                                                                                                                       
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "audit-policy" to "kube-apiserver"                                                                                                                                                                                                                                                          
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "policy-controller" to "kube-apiserver"                                                                                                                                                                                                                                                     
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "audit-log" to "kube-apiserver"                                                                                                                                                                                                                                                             
09:30:27 kubeadm[2025]: [controlplane] Adding extra host path mount "scheduler-policy" to "kube-scheduler"                                                                                                                                                                                                                                                      
09:30:27 kubeadm[2025]: [check-etcd] Checking that the etcd cluster is healthy                                                                                                                                                                                                                                                                                  
09:30:27 kubeadm[2025]: [kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.15" ConfigMap in the kube-system namespace                                                                                                                                                                                                         
09:30:27 kubeadm[2025]: [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"                                                                                                                                                                                                                                                    
09:30:27 kubeadm[2025]: [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"                                                                                                                                                                                                                                
09:30:27 kubeadm[2025]: [kubelet-start] Activating the kubelet service                                                                                                                                                                                                                                                                                          
09:30:27 kubeadm[2025]: [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...                                                                                                                                                                                                                                                                 
09:30:29 kubeadm[2025]: [etcd] Announced new etcd member joining to the existing etcd cluster                                                                                                                                                                                                                                                                   
09:30:29 kubeadm[2025]: [etcd] Wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"                                                                                                                                                                                                                                       
09:30:29 kubeadm[2025]: [etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s                                                                                                                                                                                                                                                     
09:30:38 etcd pause shim create
09:30:30 kube-scheduler pause shim create
09:30:30 kube-controller-manager pause shim create
09:30:34 kube-scheduler shim create
09:30:35 kube-scheduler first logs
09:30:36 kube-apiserver pause shim create
09:31:07 kubeadm[2025]: [kubelet-check] Initial timeout of 40s passed.                                                                                                                                                                                                                                                                                          
09:31:25 kube-controller-manager shim create
09:31:25 kube-controller-manager first logs
09:31:43 kube-apiserver shim create
09:31:44 kubeadm[2025]: error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available                                                                                                                                                                                     
09:31:44 systemd[1]: kubeadm.service: Main process exited, code=exited, status=1/FAILURE                                                                                                                                                                                                                                                                        
09:31:44 kube-apiserver first logs
09:31:51 etcd shim create
09:31:52.081609 etcd first logs

The timeout we hit here is this one which uses hardcoded values (8 times 5 seconds -> 40s)

What you expected to happen?

The etcd member get's joined to the existing control-plane node and kubeadm succeeds.

How to reproduce it (as minimally and precisely as possible)?

Hard to say.
Try lots of kubeadm joins of control-plane nodes

Anything else we need to know?

In kubeadm init there is a similar looking parameter called TimeoutForControlPlane which defaults to 4 Minutes and is used here to wait for the API server.

This is similar to me because the problem described here and the code at the kubeadm init phase waits for a specific pod, started by the kubelet via a pod manifest.

I see three options:

  • increase the hardcoded values
  • use the same parameter as already used during init (TimeoutForControlPlane) which would result in no change to the kubeadm specs
  • add an additional parameter to the kubeadm spec
@neolit123 neolit123 added area/etcd kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 7, 2019
@neolit123 neolit123 added this to the v1.16 milestone Aug 7, 2019
@timothysc
Copy link
Member

@chrischdi Several questions

  • Do you have the docker logs detailing what happened?
  • Did the etcd container start?
  • What version of docker are you running?
  • Does this happen consistently or intermittently?

@chrischdi
Copy link
Member Author

chrischdi commented Aug 7, 2019

@chrischdi Several questions

  • Do you have the docker logs detailing what happened?

Not anymore but we are running some builds every night and I will catch the logs on the next occurrence.

  • Did the etcd container start?

Yes the etcd container was running from docker perspective. The kubernetes cluster is already deleted, so I don't know its exact state. I will also try to get more information on the next occurrence.

  • What version of docker are you running?

We've got the coreos built-in docker version which is 18.06.3

  • Does this happen consistently or intermittently?
    I currently cannot reproduce it reliably.

I hope to have it occure again to get all the details and more information.

@timothysc timothysc added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Aug 7, 2019
@neolit123
Copy link
Member

@chrischdi are you joining concurrently btw?

@ereslibre
Copy link
Contributor

/assign

@chrischdi
Copy link
Member Author

@chrischdi are you joining concurrently btw?

No only one controle-plane node or worker node at the same time / sequentially

@neolit123 neolit123 changed the title kubeadm join control-plane node times out kubeadm join control-plane node times out (etcd timeout) Aug 7, 2019
@neolit123 neolit123 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Aug 7, 2019
@neolit123
Copy link
Member

neolit123 commented Aug 7, 2019

the same report here:
kubernetes/website#15637

i changed the priority and we possibly need to increase the timeout and backport to 1.15.

@neolit123 neolit123 reopened this Aug 7, 2019
@chrischdi
Copy link
Member Author

the same report here:
kubernetes/website#15637

i changed the priority and we possibly need to increase the timeout and backport to 1.15.

Let me know if I can help on this :-)

@sunvk
Copy link

sunvk commented Aug 8, 2019

Any ETA for this? I am currently blocked with my multi master setup.

@neolit123
Copy link
Member

neolit123 commented Aug 20, 2019

@chrischdi @sunvk are you reproducing this consistently?
i tried today running inside VMs and i couldn't.

also our CI is consistently green and we are not seeing the same timeouts.
(consistently, minus some other aspects)

09:31:44 kubeadm[2025]: error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available

40 seconds should be more than enough for the etcd cluster to report healthy endpoints.
what are you seeing with kubeadm ... --v=2?

@neolit123
Copy link
Member

neolit123 commented Aug 20, 2019

in terms of making this user controllable we have a field in v1beta2 and v1beat1
https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta2

called timeoutForControlPlane, but it's under the apiServer config.
it feels to me that etcd timeouts needs a new field, except that we cannot add this field or backport it to v1beta1 and v1beta2, so it has to wait for v1beta3.

the alternative is to just increase the hardcoded timeouts, but this ticket needs more evidence that it's a consistent bug.

@sbueringer
Copy link
Member

@chrischdi Is currently on leave but he can provide some more details of our problems next Tuesday. Afaik we ended up with patching the hard-coded timeout because we couldn't get our nightly installs consistently green without patching it.

@neolit123
Copy link
Member

please do.
my only explanation here would be slow hardware or networking and i would like to get the exact causes.

@chrischdi
Copy link
Member Author

I've got some more data :-)

Maybe the timeout does not need to get increased in our case. We had problems with our loadbalancers in getting active and routing traffic to the offline APIServer which caused the timeouts here.

I will need to retest using our improved loadbalancer setup (activating backends after the kubeadm init went through) if we are still hitting this issue.

@ereslibre
Copy link
Contributor

We had problems with our loadbalancers in getting active and routing traffic to the offline APIServer which caused the timeouts here.

Thank you for the feedback.

I will need to retest using our improved loadbalancer setup (activating backends after the kubeadm init went through) if we are still hitting this issue.

+1

@neolit123 neolit123 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 29, 2019
@neolit123
Copy link
Member

adding back "awaiting more evidence"

@neolit123 neolit123 modified the milestones: v1.16, v1.17 Sep 10, 2019
@chrischdi
Copy link
Member Author

As of now I'm not able to reproduce the problem anymore in our Deployment pipelines using upstream v1.15.3 kubeadm and v1.16.0 kubeadm.
@neolit123 I propose to close this one?

@ereslibre
Copy link
Contributor

Let's close this issue and reopen if we see it bubbling up again. Thank you for your feedback @chrischdi.

/close

@k8s-ci-robot
Copy link
Contributor

@ereslibre: Closing this issue.

In response to this:

Let's close this issue and reopen if we see it bubbling up again. Thank you for your feedback @chrischdi.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@christian-2
Copy link

I am facing the same issue when attempting to add a second master to a v1.15.2 cluster with kubeadm join .... An etcd snapshot taken on the first master is 19M in size. kubelet.service and the static pods (etcd, kube-apiserver, kube-controller-manager, kube-scheduler) came up on the second master, and an etcd snaptshot taken on the second master is also 19M in size. (So maybe the error indicates a non-event?)

@neolit123
Copy link
Member

neolit123 commented Nov 4, 2019

hi, are you also getting:

error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available

?

@christian-2
Copy link

hi, are you also getting:

error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available

?

@neolit123 Yes, exactly.

@neolit123
Copy link
Member

could it be that the retry of ~12 seconds between joining the second and third etcd member is not enough in your case?

@christian-2
Copy link

could it be that the retry of ~12 seconds between joining the second and third etcd member is not enough in your case?

There is currently no third etcd member in my setup. The issue already occurs when I try adding a second master node (with stacked control plane nodes) to an existing single-master cluster.

Where do the 12 seconds that you cite come from? Is this a configurable timeout that I could increase?

@neolit123
Copy link
Member

neolit123 commented Nov 5, 2019

you can try building kubeadm from source:

cd kubernetes
git checkout v1.15.2

<apply patch>

make all WHAT=cmd/kubeadm

the timeout is here:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/util/etcd/etcd.go#L41-L48

but i don't think this will solve the problem. seems to me something else is at play.

do you have the option to try 1.16.2?

@christian-2
Copy link

@neolit123 Thx for the exact pointer into source code.

Yes, the option exists. I anticipate upgrading the cluster to v1.16.2 and then adding a third master/etcd. (Other tasks first on my list, though.)

@rmja
Copy link

rmja commented May 1, 2020

I belive that I have found the issue to this.

When observing /etc/kubernetes/manifests/etcd.yaml on the backup master that is trying to join, you will see that it advertises on a different IP range than the primary master.

To avoid this, you must specify the advertise address manually when joining:

kubeadm join .. --control-plane --apiserver-advertise-address <ip>

Where <ip> is an address in the same subnet as the control plane.

@chanhz
Copy link

chanhz commented Dec 14, 2020

I belive that I have found the issue to this.

When observing /etc/kubernetes/manifests/etcd.yaml on the backup master that is trying to join, you will see that it advertises on a different IP range than the primary master.

To avoid this, you must specify the advertise address manually when joining:

kubeadm join .. --control-plane --apiserver-advertise-address <ip>

Where <ip> is an address in the same subnet as the control plane.

It works! You save my life! tks so much.

@APoniatowski
Copy link

I can confirm this

I created a cluster on centos 8 stream (fully updated as of today, including k8s) and when I added slave/worker nodes... they were added instantly/quickly. But adding another master (via load balancer):

sudo kubeadm join {LOADBALANCER DNS}:6443 --token {TOKEN} --discovery-token-ca-cert-hash [HASH] --control-plane --certificate-key [HASH]

this took anywhere between an hour or 2 (started it at 5pm CET and checked back at 9PM CET and saw it was fine/up and running)

So when adding another control-plane/master completely blocks the cluster off for a few hours

@bmmpp
Copy link

bmmpp commented May 23, 2022

I also encountered this problem on 1.24. How did you solve it. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests