-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HA clusters don't reboot properly #1689
Comments
Still needs root causing, but multiple user reports. We should fix this. |
yes, I hit the same issue today. any workaround I can manually fix it? I spent a little bit long time to set up the test KIND environment, I used it for a while. I don't want to recreate it. Any way to restore it back? Another thing which not sure if related to this problem. yesterday I upgraded KIND version from 0.7 to 0.8.1. My old nodes used to be |
I haven't looked into this issue yet. Regarding the node versions, please read the release notes about the changes, and see the usage and user guide for how to change it. |
This has never worked. 0.7 and down did not survive reboots for *any*
configuration. 0.8+ apparently doesn't survive reboots for "HA" clusters.
…On Thu, Jul 2, 2020, 18:21 Bill Wang ***@***.***> wrote:
yes, I hit the same issue today.
any workaround I can manually fix it? I spent a little bit long time to
set up the test KIND environment, I don't want to recreate it.
Any way to restore it back?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1689 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHADKYPW3VQJYFXRZSV4DLRZUXAHANCNFSM4OICEZFQ>
.
|
cc @aojea you were recently looking at the loadbalancer networking |
/assign |
it is more complicated than the load balancer, the control plane nodes has different ips and the cluster does not come up
seems we should use hostnames on the certificates to avoid this |
We do where we can already. IIRC etcd won't use hostnames.
…On Thu, Aug 27, 2020, 01:01 Antonio Ojea ***@***.***> wrote:
it is more complicated than the load balancer, the control plane nodes has
different ips and the cluster does not come up
2020-08-27 07:58:58.054356 E | etcdserver: publish error: etcdserver: request timed out
2020-08-27 07:58:58.061687 W | rafthttp: health check for peer 6dd029603bf5e797 could not connect: x509: certificate is valid for 172.18.0.7, 127.0.0.1, ::1, not 172.18.0.5
2020-08-27 07:58:58.061717 W | rafthttp: health check for peer 6dd029603bf5e797 could not connect: x509: certificate is valid for 172.18.0.7, 127.0.0.1, ::1, not 172.18.0.5
2020-08-27 07:58:58.063416 W | rafthttp: health check for peer 2b4992c658e42934 could not connect: dial tcp 172.18.0.7:2380: connect: no route to host
2020-08-27 07:58:58.063454 W | rafthttp: health check for peer 2b4992c658e42934 could not connect: dial tcp 172.18.0.7:2380: connect: no route to host
seems we should use hostnames to sign certificates to avoid this
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1689 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHADKZROGMZP7NU3UZY3D3SCYHHJANCNFSM4OICEZFQ>
.
|
Same issue here. my machine environment:macOS High Sierra v10.13.6 kind environment:kind-control-plane after docker reboot: output: Is there any workaround for this issue? BTW - when running with only one kind-control-plane the reboot passed successfully. |
There's no work around, rebooting HA (multiple control plane) clusters has never been supported and does not appear to be trivial to fix. #1689 (comment) |
Hi @BenTheElder, I guess this issue is caused by the Nodes' IP are changed during restarting docker. One possible solution is to assign a fixed IP to the Nodes. That requires 2 steps:
|
@RolandMa1986 thanks for the suggestion, but we discarded that idea before because we'll need to implement an IPAM in KIND. Also, we'll need to keep status of all the KIND clusters to handle reboots avoid conflicts with new clusters or new containers that can be created in the bridge. |
Thanks, @aojea |
it depends on the provider, currently KIND uses docker as default, that means CNM ... as you can see this is an area that will require a lot of effort to support, honestly, I don't see that we want to invest much on this ... Ben can correct me if I'm wrong |
I don't think that's a good approach. If we create non-standard IPAM this will create a headache for users vs their existing ability to configure docker today. Additionally, this approach still does not guarantee an address, and you have concurrency issues with clusters using a remote docker (where will you store and lock the IPAM data?), which otherwise works fine for users today. We can probably instead re-roll the etcd peer configuration and necessary certs on restart, but this is very low priority. The main reason to support clusters through reboot is long lived development clusters for users building applications, which should not be using "HA" clusters. Otherwise for testing / disposable clusters, this is a non-issue. |
see more here on why the "ipam" approach is not super tenable: #2045 (comment) |
i faced a pretty different issue, replicasets were not creating pods, when pod is deleted. Deployments were not creating replicasets, after i restarted the machine |
Hi @BenTheElder I know that using DNS names is the cleanest solution for issue #2045 However I am using this script as a workaround to use static IPs for the nodes communication I have restarted my cluster several times and it has worked fine so far |
Users may have multiple clusters and that is hard to support, however, your script is great, I think that it also can solve the problems of snapshotting HA clusters. |
That's a neat script! It's unfortunately not super workable as an approach to a built-in solution though. Users creating clusters concurrently in CI (and potentially with a "remote" daemon due to containerized CI) are very important to us and this approach is not safe there. |
can't we extend the kind config to take ip-node mapping as an optional parameter: to only those who know what they are doing. |
This is not without its own drawbacks.
Multi-node clusters are a necessity for testing Kubernetes itself (where we expect clusters to be disposable over the course of developing some change to Kubernetes). The case of:
Seems rather rare and I'm not sure it outweighs adding a broken partial solution that people will then depend on in the future even if we find some better design. I'm not saying we definitely couldn't do this, but I wouldn't jump to doing it today. |
k3d seems to have done something about this here: k3d-io/k3d#550 (comment) which links back to this issue in our repo.
This looks to me like a broken approach to identifying an available subnet (there is at minimum a race between acquiring / deleting the "fake" network and creating the real one with two clusters), but I'm also unclear as of yet if the IP range is used on a per-cluster network or IPs outside of another network's range are used on that network. It may be worth digging into the approach there more. |
this is where plugins help!! Anyways, it must be mentioned in the quick start/ other doc, that multinode will not survive restart. When I first face the issue, it was hard time debugging. |
We document this sort of thing at https://kind.sigs.k8s.io/docs/user/known-issues/ which the quick start links to prominently, but it seems this issue hasn't made it there yet. Earlier versions did not support host restart at all, it wasn't in scope early in the project. |
Can I second this, I just spent several days building a multi-node cluster, then on the first reboot, effectively lost the lot. Not best pleased, especially when after researching my problem, finding this is a known issue. For the sake of the sanity of others, can someone please put a simple warning about this in the known issues section of the Kind documentation. |
FWIW:
We have a detailed contributing guide including how to contribute to the docs, the known issues page is written in markdown in this repo. No tools other than git / github / markdown text are required. |
@BenTheElder what do you imply by single vs multi-node clusters referring to Kind? You mean multiple control-plane nodes or multiple worker ones? I'd never need multiple control-plane nodes in Kind, but I sometimes do need multiple worker nodes to test tolerances, affinities etc. And I would like the clusters with multiple worker nodes to survive a reboot. |
Yes, affinities and tolerances are a case for using multi-node and that issue is #2405 |
Please for the most stupid of us, tell what you understand by multi-node. Is a cluster with multiple worker nodes and one control-plane node still a multi-node?
Sorry did not see anything relevant there. Maybe it's just me. |
Multi-node has multiple nodes. This issue is specifically about problems with clusters that have multiple control-plane nodes ("HA"). Solving issues related to multi-node reboots in general will likely leave multi-control-plane specific issues (specifically, around the loadbalancer).
I typoed #2045 on mobile, but it is already linked and discussed in the previous comment #1689 (comment) |
I think #2775 has the general right idea of "just fixup the IP(s) on startup" but would need a bit more thought for HA clusters. In general "HA" / multi-control-plane-node kind clusters could use more thought. We've had little demand for them so far, Kubernetes has surprisingly little CI for this at the moment. |
Here's what I've been doing to get my HA kind cluster back up after reading this post. After rebooting the host, the cluster did not come up as expected. Interestingly, external load balancer is not restarted automatically on reboot and requires a manual start - hmm! I then took note of the load balancer IP from the kube config file, and the control-plane IP's from what etcd was advertising as on each node. I then followed these steps to bring it all up again:
NOTE: Rebooting host or restarting docker after this manual IP assignment results in the nodes now remembering their IP addresses. Interestingly though, I occasionally need to restart some nodes to resolve crash loops. EDIT: Adding my kind config
|
Replying to myself: I believe the first time I did this I saved the docker network state - |
#2775 as mentioned in #1689 (comment) above is pretty recent and does involve patching the local component manifests' IP for the node's last => current IP. |
Thanks @BenTheElder - As I trawled more regarding this issue, I came across this post in the other closed issue that states the same thing. Apologies for flooding... |
Is there currently a solution to this problem? I tested using version v0.24.0 and found that it was still not resolved. Now I occasionally restore the kind cluster by restarting docker multiple times. |
Not readily, though FWIW I don't recommend HA control plane for development (or even multi-node) unless you have realllly specific testing needs (e.g. developing an HA control plane component). HA in general is not receiving a lot of attention, kind or otherwise. That may change, I've been talking to a couple contributors regarding testing new HA improvements in Kubernetes (like kubernetes/enhancements#4020). In general we spend most energy in kind on making clusters come up quickly and reliably, for disposable testing. |
first reported in #1685
tracking in an updated bug.
reproduce with:
+ restart docker.
The text was updated successfully, but these errors were encountered: