-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm join --control-plane to create HA setup killed the cluster #2275
Comments
hi,
just to double check, you meant
and
is that a "controlPlaneEndpoint"? if no, then the second controller will not work after the "upload-certs" command.
the original CP node, should not break unless something happened with etcd. /triage support |
Hey, thanks for replying.
Oh right, I mean that the initial cluster had all nodes with 1.18.6, then all nodes upgraded to 1.18.9.* and finally 1.19.0. The current deb package version is 1.19.0-00.
Yeah I've been using the DNS name to talk with the cluster from my desktop client. It had both controlPlaneEndpoint and advertise-address set in the kubeadm-config which I saved at least while doing the previous 1.19.0 upgrade. However, the DNS entry of controlPlaneEndpoint has not yet been hooked up with the IP of the 2nd controller. As I wanted to wait with this the next step. To my understanding this setting doesn't affect etcd replication at all.
I did not specify * --control-plane-endpoint* while doing the init phase upload-certs, if that makes a difference.
I was able to list containers and can see etcd etcd:3.4.9-1 running, but I failed to find anything about logs.
Yeah, perhaps it's an etcd issue rather than something with how kubeadm works. If only I could see what etcd is logging |
While trying to connect to etcd with curl and etcdctl, there are no reply at all. even when the tcp connect is successful. |
after a few container deletes and etcdctl snapshot restore while also stopping the 2nd controller from trying to join, it seems I'm at least having the cluster with one controller back into a functional state |
in general, you should pass the same --config or flags you have passed to kubeadm init to it's phases if you are calling them on demand. otherwise you could get the phases generating "content" that is different from what you want.
i don't have |
etcd could have crashed, you could file logs in a new issue in the kubernetes/kubernetes repository or etcd repository if you have them and see e.g. panics.
that is good. all of our kubeadm CI uses the following:
so, this is a supported scenario: yet, unclear what happened in your case. i'm going to close this support the ticket, but please drop a message if you find out what happened. |
@neolit123: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Alright I figured it out eventually. The first issue was that I did not provide The second issue which probably only happened because I So if there are any issues with etcd connectivity while joining a new controller, your cluster will go down since etcd can't figure out who's leader. To anyone else getting into this broken state, I did this to recover previously:
This would make sure etcd doesn't try to hook up with a 2nd peer which would break quorum when it doesn't respond. |
BUG REPORT
Versions
kubeadm version (use
kubeadm version
): v1.19.0Environment:
kubectl version
): v1.19.0uname -a
): 5.4.0-1015-raspi aarch64What happened?
While following the high availability guide at https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/ to join another controller node for replication and high availability, the cluster instead stopped (ironically) working.
With the first controller node fully functional with working nodes and scheduling pods as far as I could tell, these steps were taken on the controller1 (working) and controller2 (to be joined into HA):
controller1$ kubeadm init phase upload-certs --upload-certs
controller1$ kubeadm token create
controller2$ kubeadm join --token <copied-from-controller1-output> --discovery-token-unsafe-skip-ca-verification --control-plane --certificate-key <copied-from-controller1-output> api.example.com:6443
Now the output on controller2 stopped with:
Going back to controller1, it could no longer connect to the api server. The cluster doesn't respond any more.
Restarting kubelet resulted in a looping log of:
node "controller1" not found
It seems to me the etcd data on controller1 somehow vanished or became corrupt after the attempted join by controller2. However I'm not sure exactly how to check for etcd logs while running as static pod on containerd instead of docker.
What you expected to happen?
I would never expect that the first controller might break while joining the second one.
How to reproduce it (as minimally and precisely as possible)?
The setup is as follows:
Anything else we need to know?
It's a small cluster, there have been no performance issues with the rPi controllers or other etcd corruption before this
The text was updated successfully, but these errors were encountered: