Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting host lets control plane stop working #2640

Closed
DonRichie opened this issue Feb 20, 2022 · 5 comments
Closed

Restarting host lets control plane stop working #2640

DonRichie opened this issue Feb 20, 2022 · 5 comments
Labels
kind/support Categorizes issue or PR as a support question.

Comments

@DonRichie
Copy link

Hello,

Situation:

I have a kind cluster in a virtual machine. I deployed the kubernetes dashboard in it via helm.

Problem:

If I shut down the VM and restart it the kind cluster controller seems to be stuck.
For example if I delete all pods in the "dashboard" namespace, they are not recreated. I expect the controller to recreate the desired state according to the deployments.

What I tried:

  • There are no events indiciating a failure. The deployment also shows no activity. Everything seems to be stuck.
  • I tried to restart the control plane container "docker restart kind-control-plane"
  • I opened a shell in the control plane container. kube-controller-manager process seems to run.
  • kubernetes api is reachable at any time if kind-control-plane is up

Question:

Does someone have an idea what I can do to debug this problem or why it is caused?
Currently I am completely reinstalling the virtual machine every time to make the cluster work again.

Please let me know what additional information I can provide.

@DonRichie DonRichie added the kind/support Categorizes issue or PR as a support question. label Feb 20, 2022
@BenTheElder
Copy link
Member

kind export logs contains a wealth of information for debugging. It's hard to say what broke, but restarts may sometimes break things, particularly with multi node clusters: #2045

@DonRichie
Copy link
Author

DonRichie commented Feb 24, 2022

It is happening right now. I needed to restart the server because my virtualbox showed an error.

The Pods are stuck and show this:

pod/openwhisk-alarmprovider-687b588f44-mzkbc     0/1     Init:0/1   0          3d3h
pod/openwhisk-apigateway-7479b6b55f-l8zlf        1/1     Running    1          3d3h
pod/openwhisk-controller-0                       0/1     Init:0/2   0          3d3h
pod/openwhisk-couchdb-7dfc856854-g942m           1/1     Running    1          3d3h
pod/openwhisk-invoker-0                          0/1     Init:0/1   0          3d3h
pod/openwhisk-kafka-0                            0/1     Init:0/1   1          3d3h
pod/openwhisk-kafkaprovider-6c69bd4788-djskk     0/1     Init:0/1   0          3d3h
pod/openwhisk-nginx-7dc79594cd-ws92c             0/1     Init:0/1   1          15h
pod/openwhisk-redis-5bb7b9c5d5-2nc82             1/1     Running    1          3d3h
pod/openwhisk-zookeeper-0                        1/1     Running    1          3d3h
pod/wskopenwhisk-invoker-00-1-prewarm-nodejs10   1/1     Running    1          3d3h
pod/wskopenwhisk-invoker-00-2-prewarm-nodejs10   1/1     Running    1          3d3h

Here is the output of "kind export logs":
kind_export_logs.tar.gz

As I read in the other issue you linked have something with my cluster having two worker nodes.
I will reinstall the VM and switch to a cluster with one controller and one worker node. Hope that works.

I am really confused why such an error exists and why you write in the other issue that has no prioty.
People are trying out kubernetes with kind and often will have more than one worker node. (Since that is necessary for trying out some features). And I am sure they restart their computers from time to time.
Each single one will wonder why the cluster broke and get frustrated.

I would like to democratically vote for a higher priority. In any case thank you for your work.

@DonRichie
Copy link
Author

Daily story of pain, you can ignore that:

1. Okay I deleted the kind cluster. Lets reapply my ansible playbook again to create the cluster
-> Oh, an error, no cluster created? But why?
TASK [kubernetes_kind : create kind cluster, using custom config] **************
skipping: [kubernetes-serverless]

Ansible kind of skipped cluster creation. 

2. Okay I found it out, In my playbook I assumed I only need to create the cluster once
Fixed it. 

3. Cluster created again
Luckily my playbook already installs the software via helm.

But it doesn't create the token for the kubernetes dashboard (yet). Need to do that manually.

4. Oh, can't do stuff. the .kube/config changed. Need to copy it from the virtual machine to my host machine.

5. Okay, finally Token Created. Dashboard works again
Luckily I noted the steps to make the dashboard work.

6. Now I need to reproduce the changes I made with the software I want to actually work with.
Reconfiguring nginx here, configuring openwhisk there, creating actions, ....

7. Awesome I can work again.
Oh, 2 hours of my day already gone and I basically did nothing?

Luckily I can suspend the virtual machine over night preventing restarts. Might be my solution for now.
But this issue is a huge problem. Please fix it with high priority.

@BenTheElder
Copy link
Member

BenTheElder commented Feb 24, 2022

I'm sorry, but I have very limited time to work on this right now, I review and approve PRs, triage bug reports, etc., but the Kubernetes project has pressing work elsewhere (e.g. we are exceeding our $3M/year GCP budget), and I have other obligations (e.g. writing peer feedback for performance reviews at work).
EDIT: also as you can see there are many other open issues.

This seems to be a duplicate of #2045, which has much more context on the situation.

I highly recommend using a single node cluster as well, unless you have a strong concrete need for multiple nodes. They share the same host resources and multiple nodes is only implemented as a requirement for testing some kubernetes internals (see also: https://kind.sigs.k8s.io/docs/contributing/project-scope/). For development of Kubernetes applications, a single node is preferable and better supported.

If you'd like to help resolve multi-node reboots, please take a look at #2045

@DonRichie
Copy link
Author

Okay, thank you. I will somehow manage to work with it.
I wish much success with your responsibilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

2 participants