Restarting host lets control plane stop working #2640

DonRichie · 2022-02-20T23:25:52Z

Hello,

Situation:

I have a kind cluster in a virtual machine. I deployed the kubernetes dashboard in it via helm.

Problem:

If I shut down the VM and restart it the kind cluster controller seems to be stuck.
For example if I delete all pods in the "dashboard" namespace, they are not recreated. I expect the controller to recreate the desired state according to the deployments.

What I tried:

There are no events indiciating a failure. The deployment also shows no activity. Everything seems to be stuck.
I tried to restart the control plane container "docker restart kind-control-plane"
I opened a shell in the control plane container. kube-controller-manager process seems to run.
kubernetes api is reachable at any time if kind-control-plane is up

Question:

Does someone have an idea what I can do to debug this problem or why it is caused?
Currently I am completely reinstalling the virtual machine every time to make the cluster work again.

Please let me know what additional information I can provide.

BenTheElder · 2022-02-22T23:51:16Z

kind export logs contains a wealth of information for debugging. It's hard to say what broke, but restarts may sometimes break things, particularly with multi node clusters: #2045

DonRichie · 2022-02-24T03:15:28Z

It is happening right now. I needed to restart the server because my virtualbox showed an error.

The Pods are stuck and show this:

pod/openwhisk-alarmprovider-687b588f44-mzkbc     0/1     Init:0/1   0          3d3h
pod/openwhisk-apigateway-7479b6b55f-l8zlf        1/1     Running    1          3d3h
pod/openwhisk-controller-0                       0/1     Init:0/2   0          3d3h
pod/openwhisk-couchdb-7dfc856854-g942m           1/1     Running    1          3d3h
pod/openwhisk-invoker-0                          0/1     Init:0/1   0          3d3h
pod/openwhisk-kafka-0                            0/1     Init:0/1   1          3d3h
pod/openwhisk-kafkaprovider-6c69bd4788-djskk     0/1     Init:0/1   0          3d3h
pod/openwhisk-nginx-7dc79594cd-ws92c             0/1     Init:0/1   1          15h
pod/openwhisk-redis-5bb7b9c5d5-2nc82             1/1     Running    1          3d3h
pod/openwhisk-zookeeper-0                        1/1     Running    1          3d3h
pod/wskopenwhisk-invoker-00-1-prewarm-nodejs10   1/1     Running    1          3d3h
pod/wskopenwhisk-invoker-00-2-prewarm-nodejs10   1/1     Running    1          3d3h

Here is the output of "kind export logs":
kind_export_logs.tar.gz

As I read in the other issue you linked have something with my cluster having two worker nodes.
I will reinstall the VM and switch to a cluster with one controller and one worker node. Hope that works.

I am really confused why such an error exists and why you write in the other issue that has no prioty.
People are trying out kubernetes with kind and often will have more than one worker node. (Since that is necessary for trying out some features). And I am sure they restart their computers from time to time.
Each single one will wonder why the cluster broke and get frustrated.

I would like to democratically vote for a higher priority. In any case thank you for your work.

DonRichie · 2022-02-24T03:39:36Z

Daily story of pain, you can ignore that:

1. Okay I deleted the kind cluster. Lets reapply my ansible playbook again to create the cluster
-> Oh, an error, no cluster created? But why?
TASK [kubernetes_kind : create kind cluster, using custom config] **************
skipping: [kubernetes-serverless]

Ansible kind of skipped cluster creation. 

2. Okay I found it out, In my playbook I assumed I only need to create the cluster once
Fixed it. 

3. Cluster created again
Luckily my playbook already installs the software via helm.

But it doesn't create the token for the kubernetes dashboard (yet). Need to do that manually.

4. Oh, can't do stuff. the .kube/config changed. Need to copy it from the virtual machine to my host machine.

5. Okay, finally Token Created. Dashboard works again
Luckily I noted the steps to make the dashboard work.

6. Now I need to reproduce the changes I made with the software I want to actually work with.
Reconfiguring nginx here, configuring openwhisk there, creating actions, ....

7. Awesome I can work again.
Oh, 2 hours of my day already gone and I basically did nothing?

Luckily I can suspend the virtual machine over night preventing restarts. Might be my solution for now.
But this issue is a huge problem. Please fix it with high priority.

BenTheElder · 2022-02-24T06:26:21Z

I'm sorry, but I have very limited time to work on this right now, I review and approve PRs, triage bug reports, etc., but the Kubernetes project has pressing work elsewhere (e.g. we are exceeding our $3M/year GCP budget), and I have other obligations (e.g. writing peer feedback for performance reviews at work).
EDIT: also as you can see there are many other open issues.

This seems to be a duplicate of #2045, which has much more context on the situation.

I highly recommend using a single node cluster as well, unless you have a strong concrete need for multiple nodes. They share the same host resources and multiple nodes is only implemented as a requirement for testing some kubernetes internals (see also: https://kind.sigs.k8s.io/docs/contributing/project-scope/). For development of Kubernetes applications, a single node is preferable and better supported.

If you'd like to help resolve multi-node reboots, please take a look at #2045

DonRichie · 2022-02-24T10:14:06Z

Okay, thank you. I will somehow manage to work with it.
I wish much success with your responsibilities.

DonRichie added the kind/support Categorizes issue or PR as a support question. label Feb 20, 2022

DonRichie closed this as completed Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting host lets control plane stop working #2640

Restarting host lets control plane stop working #2640

DonRichie commented Feb 20, 2022

BenTheElder commented Feb 22, 2022

DonRichie commented Feb 24, 2022 •

edited

Loading

DonRichie commented Feb 24, 2022

BenTheElder commented Feb 24, 2022 •

edited

Loading

DonRichie commented Feb 24, 2022

Restarting host lets control plane stop working #2640

Restarting host lets control plane stop working #2640

Comments

DonRichie commented Feb 20, 2022

Situation:

Problem:

What I tried:

Question:

BenTheElder commented Feb 22, 2022

DonRichie commented Feb 24, 2022 • edited Loading

DonRichie commented Feb 24, 2022

BenTheElder commented Feb 24, 2022 • edited Loading

DonRichie commented Feb 24, 2022

DonRichie commented Feb 24, 2022 •

edited

Loading

BenTheElder commented Feb 24, 2022 •

edited

Loading