-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When after restart docker, kind cluster could't connect #1685
Comments
we need to know more details, like what version you're using. |
it would also be helpful to know if this happens with a simple |
Thank you for your quick reply. I use 0.8.1: tsunomur@VM:~$ kind --version
kind version 0.8.1 When I created a simple cluster, not same stituation. create cluster and check healthtsunomur@VM:~$ kind create cluster
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.18.2) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Have a nice day! 👋
tsunomur@VM:~$ k run nginx --image nginx --restart=Never
pod/nginx created
tsunomur@VM:~$ k get po
NAME READY STATUS RESTARTS AGE
nginx 0/1 ContainerCreating 0 2s
tsunomur@VM:~$ k get po -w
NAME READY STATUS RESTARTS AGE
nginx 0/1 ContainerCreating 0 3s
nginx 1/1 Running 0 17s
^Ctsunomur@VM:~$ k cluster-info
Kubernetes master is running at https://127.0.0.1:38413
KubeDNS is running at https://127.0.0.1:38413/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
tsunomur@VM:~$ k get componentstatuses
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health":"true"}
tsunomur@VM:~$
tsunomur@VM:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
be19fb44893d kindest/node:v1.18.2 "/usr/local/bin/entr…" 2 minutes ago Up About a minute 127.0.0.1:38413->6443/tcp kind-control-plane
tsunomur@VM:~$ Restart docker and check healthtsunomur@VM:~$ sudo systemctl stop docker
tsunomur@VM:~$ sudo systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2020-06-23 17:54:16 UTC; 7s ago
Docs: https://docs.docker.com
Process: 31153 ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock (code=exited, status=0/SUCCESS)
Main PID: 31153 (code=exited, status=0/SUCCESS)
Jun 23 17:15:16 VM dockerd[31153]: time="2020-06-23T17:15:16.225369221Z" level=info msg="API listen on /var/run/docker.sock"
Jun 23 17:15:16 VM systemd[1]: Started Docker Application Container Engine.
Jun 23 17:16:43 VM dockerd[31153]: time="2020-06-23T17:16:43.583494164Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 23 17:54:02 VM systemd[1]: Stopping Docker Application Container Engine...
Jun 23 17:54:02 VM dockerd[31153]: time="2020-06-23T17:54:02.941309822Z" level=info msg="Processing signal 'terminated'"
Jun 23 17:54:12 VM dockerd[31153]: time="2020-06-23T17:54:12.957486326Z" level=info msg="Container be19fb44893d46e0e7800cd8af414b80fc5d4bccd0d050ce282a685dd93d3735 failed to exit within
Jun 23 17:54:15 VM dockerd[31153]: time="2020-06-23T17:54:15.089736440Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 23 17:54:16 VM dockerd[31153]: time="2020-06-23T17:54:16.254422134Z" level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=m
Jun 23 17:54:16 VM dockerd[31153]: time="2020-06-23T17:54:16.254886136Z" level=info msg="Daemon shutdown complete"
Jun 23 17:54:16 VM systemd[1]: Stopped Docker Application Container Engine.
tsunomur@VM:~$ sudo systemctl start docker
tsunomur@VM:~$ k cluster-info
Kubernetes master is running at https://127.0.0.1:38413
KubeDNS is running at https://127.0.0.1:38413/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
tsunomur@VM:~$ k get componentstatuses
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy {"health":"true"}
tsunomur@VM:~$ k run nginx-after-restart --image nginx --restart=Never
pod/nginx-after-restart created
tsunomur@VM:~$ k get po
NAME READY STATUS RESTARTS AGE
nginx 0/1 Unknown 0 2m
nginx-after-restart 0/1 ContainerCreating 0 2s But if only a Pod(no manage by Deployment) status is Error, I will recreate. I won't use multi-node cluster yet. |
yeah some errored pods is expected, not all things handle the IP switch well etc. The cluster not coming back up with multi node is not, what happens if you use: # a cluster with 3 control-plane nodes and 3 workers
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker It's possible we have a bug in the "HA" mode, it's not well tested or used for much currently. |
I tried only a control-plane cluster with multi worker, and then reboot dockerd, it's seem to good condition. Thank you. |
I think this issue should be re-opened. The problem occurs when more than 1 control-plane is used. I could reproduce easily using this config (kind v0.8.1, docker 19.03.11-ce) : kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
|
I don't think 2 control planes is valid in kubeadm @rolinh, only 3? I thought we validated this but we must not. That said, it does seem we have a bug here with multiple control planes. I'm going to interject a brief note: I highly recommend testing with a single node cluster unless you have strong evidence that multi-node is relevant, doubly so for multi-control plane. |
@BenTheElder fwiw, the issue is the same with 3 control planes.
Would you mind expanding on this? Why is this a problem? I've been testing things with up to 50 nodes clusters without issues so far except upon docker service restart (or machine reboot). As a single control-plane is sufficient, I'll stick to this but I do require to tests things in a multi-node clusters. |
50 nodes? Cool! That's actually the largest single kind cluster I've heard of so far :-) Many (most?) apps are unlikely to gain anything testing wise from multiple nodes, but running multi-node kind clusters overcommits the hardware (each node reports having the full host resources) while adding more overhead. The "HA" mode is not actually HA due to etcd and due to running on top of one physical host ... it is somewhat useful for certain things where multiple api-servers matters. Similarly multi-node is used for testing where multi-node rolling behavior matters (we test kubernetes itself with 1 control plane and 2 workers typically), outside of that it's just extra complexity and overhead. |
I've tried to push it further just out of curiosity but a 100 nodes cluster attempt brought my machine down to its knees with a ridiculous 2500+ load average at some point 😁 I work on Cilium (so I use kind with the Cilium CNI) and at the moment more specifically on Hubble Relay for cluster wide observability and being able to test things in a local multi-nodes cluster is just amazing. I used to have to run multiple VMs but this process is much heavier. We're also able to test things like cluster mesh with kind. We also recently introduced kind as part of our CI to run smoke tests. |
cool, that's definitely one of those apps that will benefit from multi-node :-) |
tracking the HA restart issue with a bug here #1689 |
Facing this issue on one control-plane and 2 nodes kind cluster , when i start the cluster all things work fine but when in restart my machine the pods go in pending state below are some of the outputs
when i do a describe of the nodes there are no events recorded
same goes for when describing pod
all the pods in kube-system pods are ok
there is also metallb deployed
have observed few logs in kube-scheduler pods that are suspicious and are complaining of connection timed out issues
the kind config i am using
and kind version
|
your containers have changed their IP after restart, control plane now is 172.18.0.2
and before it should be 172.18.0.3
docker assign ip randomly, or your containers restart with the same IPs that they had before or your cluster will not work |
yes @aojea thanks. just noticed that , after few restarts the control plane returned to its original 172.18.0.3 and now things are fine , but during next restart it will again assign ips randomly and it will fail . What can be done for this , i thought this issue only persists for ha control plane cluster . Is this outcome expected or we can change few configs ? |
is how IP assignment works in docker, making it more predictable from KIND will requires to overcomplicate the code and will cause compatibility problems ... the ideal solution will be for docker IP assignment to try to keep the same IPs after restart |
Hey guys what do you think of this: Assumption: When linking a docker container to another docker run --rm -d --name first alpine:3.14 sleep inf
docker run --rm -d --name second --link=first alpine:3.14 sleep inf Checking docker exec -it first sh
/ # cat /etc/hosts
127.0.0.1 localhost
...
172.17.0.2 4834508c1103 Checking docker exec -it second sh
/ # cat /etc/hosts
127.0.0.1 localhost
...
172.17.0.2 first 4834508c1103
172.17.0.3 4e7d766612d9 The Trying to start docker run --rm -d --name second --link=first alpine:3.14 sleep inf
docker: Error response from daemon: could not get container for first: No such container: first. What if using this mechanism to order the start of When creating the nodes adding something like: control-plane containers link to their predecessors: args = append(args, "--link=kind-control-plane", "--link=kind-control-plane2") worker container link to all control-plane containers: args = append(args, "--link=kind-control-plane", "--link=kind-control-plane2", "--link=kind-control-plane3") What do you think? Is it worth a shot? |
Docker --link is deprecated. |
I faced same issue as #1685 (comment) in a multi-node cluster with single controlplane node. I saw the only issue is kube-controller-manager and kube-scheduler cannot connect kube-apiserver. Since enable_network_magic has done some magic, I tried to replace the stale local node IP with loopback address in the same place and the cluster is working after that. I created #2671 with above change to see if it's an acceptable approach for fixing restart case of multi-node cluster with single controlplane node. |
I created kind cluster with following YAML:
And then restart docker(same reboot machine):
$ sudo systemctl stop docker
Result: disappear kind-external-load-balancer and even if rewrite cluster-url to control-plane's IP addr force, Pod deployment will pending ever.
Does kind not support restart machine?
Ref:
The text was updated successfully, but these errors were encountered: