-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm join control-plane node times out (etcd timeout) #1712
Comments
@chrischdi Several questions
|
Not anymore but we are running some builds every night and I will catch the logs on the next occurrence.
Yes the etcd container was running from docker perspective. The kubernetes cluster is already deleted, so I don't know its exact state. I will also try to get more information on the next occurrence.
We've got the coreos built-in docker version which is
I hope to have it occure again to get all the details and more information. |
@chrischdi are you joining concurrently btw? |
/assign |
No only one controle-plane node or worker node at the same time / sequentially |
the same report here: i changed the priority and we possibly need to increase the timeout and backport to 1.15. |
Let me know if I can help on this :-) |
Any ETA for this? I am currently blocked with my multi master setup. |
@chrischdi @sunvk are you reproducing this consistently? also our CI is consistently green and we are not seeing the same timeouts.
40 seconds should be more than enough for the etcd cluster to report healthy endpoints. |
in terms of making this user controllable we have a field in v1beta2 and v1beat1 called the alternative is to just increase the hardcoded timeouts, but this ticket needs more evidence that it's a consistent bug. |
@chrischdi Is currently on leave but he can provide some more details of our problems next Tuesday. Afaik we ended up with patching the hard-coded timeout because we couldn't get our nightly installs consistently green without patching it. |
please do. |
I've got some more data :-) Maybe the timeout does not need to get increased in our case. We had problems with our loadbalancers in getting active and routing traffic to the offline APIServer which caused the timeouts here. I will need to retest using our improved loadbalancer setup (activating backends after the kubeadm init went through) if we are still hitting this issue. |
Thank you for the feedback.
+1 |
adding back "awaiting more evidence" |
As of now I'm not able to reproduce the problem anymore in our Deployment pipelines using upstream v1.15.3 kubeadm and v1.16.0 kubeadm. |
Let's close this issue and reopen if we see it bubbling up again. Thank you for your feedback @chrischdi. /close |
@ereslibre: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I am facing the same issue when attempting to add a second master to a v1.15.2 cluster with |
hi, are you also getting:
? |
@neolit123 Yes, exactly. |
could it be that the retry of ~12 seconds between joining the second and third etcd member is not enough in your case? |
There is currently no third etcd member in my setup. The issue already occurs when I try adding a second master node (with stacked control plane nodes) to an existing single-master cluster. Where do the 12 seconds that you cite come from? Is this a configurable timeout that I could increase? |
you can try building kubeadm from source:
the timeout is here: but i don't think this will solve the problem. seems to me something else is at play. do you have the option to try 1.16.2? |
@neolit123 Thx for the exact pointer into source code. Yes, the option exists. I anticipate upgrading the cluster to v1.16.2 and then adding a third master/etcd. (Other tasks first on my list, though.) |
I belive that I have found the issue to this. When observing To avoid this, you must specify the advertise address manually when joining:
Where |
It works! You save my life! tks so much. |
I can confirm this I created a cluster on centos 8 stream (fully updated as of today, including k8s) and when I added slave/worker nodes... they were added instantly/quickly. But adding another master (via load balancer):
this took anywhere between an hour or 2 (started it at 5pm CET and checked back at 9PM CET and saw it was fine/up and running) So when adding another control-plane/master completely blocks the cluster off for a few hours |
I also encountered this problem on 1.24. How did you solve it. thanks |
What keywords did you search in kubeadm issues before filing this one?
etcd join timeout
kubeadm join timeout
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version (use
kubeadm version
): kubeadm version: &version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.2", GitCommit:"f6278300bebbb750328ac16ee6dd3aa7d3549568", GitTreeState:"clean", BuildDate:"2019-08-05T09:20:51Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}Environment:
Kubernetes version (use
kubectl version
): v1.15.2Cloud provider or hardware configuration: Openstack
OS (e.g. from /etc/os-release): Container Linux by CoreOS 2135.5.0 (Rhyolite)
Kernel (e.g.
uname -a
): Linux os1pi019-kube-master01 4.19.50-coreos-r1 kubeadm join on slave node fails preflight checks #1 SMP Mon Jul 1 19:07:03 -00 2019 x86_64 Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz GenuineIntel GNU/LinuxOthers:
What happened?
kubeadm join
was invoced and failed.The etcd container did start up 7 seconds after kubeadm timed out / did exit with failure.
See the following logs (this include kubeadm logs and timestamps for pod-manifest starts):
The timeout we hit here is this one which uses hardcoded values (8 times 5 seconds -> 40s)
What you expected to happen?
The etcd member get's joined to the existing control-plane node and kubeadm succeeds.
How to reproduce it (as minimally and precisely as possible)?
Hard to say.
Try lots of
kubeadm joins
of control-plane nodesAnything else we need to know?
In
kubeadm init
there is a similar looking parameter called TimeoutForControlPlane which defaults to 4 Minutes and is used here to wait for the API server.This is similar to me because the problem described here and the code at the kubeadm init phase waits for a specific pod, started by the kubelet via a pod manifest.
I see three options:
TimeoutForControlPlane
) which would result in no change to the kubeadm specsThe text was updated successfully, but these errors were encountered: