-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Kube-System pods not running with Unknown Status #6185
Comments
The kubectl output you provided doesn't really give enough info to figure out why the pods are not ready. I would probably check the k3s and containerd logs on the nodes hosting those pods, there may be some useful messages in there that will suggest what to investigate next. |
The Solution that I found to resolve the problem - at least for now, was to manually restart all of the kube-system deployments found using the command deployments
|
Were you able to collect any logs that might shed some light on why they were in that state to begin with? |
I ran into this today. journald had a bunch of these logs:
Redeploying the kube-system stuff and manually removing the pods that were stuck in Terminating eventually got things going again here. Ref. containerd/containerd#4016 (comment) |
@vwbusguy what version of K3s are you on? That's a pretty old issue. |
v1.25.4+k3s1 |
Any idea what happened on your host to trigger that? Sounds like something corrupted or truncated the CNI state files on disk. |
@brandond would repeated kernel panic segfaults on the hypervisor requiring hard reboots sound like a plausible root cause? I just ended up with this - one pod from the affected node permanently stuck
However, I am still not able to get rid of the pod - the container ID does not appear in |
Yes - a recently botched selinux update with CentOS Stream 9. The policy failed to compile and much of the k3s filesystem stopped becoming writable as SELinux started blocking everything. SELinux was disabled at grub and then I had to work through these errors to recover it. Been running it with selinux enabled for a year on CS9 and this is the first time something like this has happened, but I'm not sure if it's k3s or cs9 related, but in any case, the cleanup recovery could have been done more elegantly from the k3s side of things since the lower level calls worked just fine. |
One of the nodes of my v1.24.3+k3s1 cluster was spamming these messages in the journal for a couple of months until I was able to finally clean this up.
For other folks who might encounter this, you can make the journal spam stop by following these steps:
This worked to stop the journal spam for me! All of the mentioned files were two months old from an...episode that my cluster and I had back then. The machine had been rebooted a few times since then, and |
I'm having the same problem after some restarts of k3s/containerd and doing what @chrisguidry described in #6185 (comment) seems to work (although I had to explicitly remove the old pod via kubectl/crictl). |
I am also facing this issue when I restart the node and call
Just providing more info to see if we could find the issue. the k3s version is v1.24.6 |
Don't do that? You can't reason about the state of the cluster before all the components, including the kubelets and controller-manager, are up and ready. |
@harrison-tin Is this behavior reproducible for you? So when you restart the node and call the k3s API before the cluster is ready, will the pods always (and forever until a manual fix) be stuck in the |
@ChrisIgel the behavior is always reproducible (always stuck in |
Okay, I've tried multiple different things but the behavior you describe is not reproducible for me, so it seems we're having two different issues :( |
Copy paste k3s ctr container rm $(find /var/lib/cni/flannel/ -size 0 | xargs -n1 basename)
find /var/lib/cni/flannel/ -size 0 -delete
find /var/lib/cni/results/ -size 0 -delete # cached result from #6185 (comment) |
Seeing this on Happens every now and then on a fleet of hundreds of k3s clusters. |
To me this happend after a hard poweroff as well. Would be nice if it could recover on it's own. Sounds like deleting corrupted files that are there for caching should be safe. |
Yep, indeed, we are now rolling out external mitigations deleting the files on boot (seeing this quite a lot now, like 5-10 a week, due to having many clusters in power instable locations.). Given the failure mode of zero byte files, it's probably a missing fsync somewhere in K3S code (by default ext4 journals metadata only not data). Given the ephemeral nature of the file data, K3S should indeed be able to detect and recover. We are planning to look into the code and repro and fix and send a PR but have not have had the cycles yet. |
Note that the affected code is packaged and distributed by this project, but IS NOT part of our codebase. You'll want to look at the upstream projects (containerd, flannel, and so on) to solve these issues. |
Also seeing this on |
Apparently this is fixed in Flannel CNI plugin release 1.1.2. |
@tuupola this should be addressed by |
Same issue with latest k3s. I don't think any of the nodes were shut down.
|
@ad-zsolt-imre please open a new issue if you are having problems. Fill out the issue template and provide steps to reproduce if possible. |
Environmental Info:
K3s Version:
k3s version v1.24.3+k3s1 (990ba0e)
go version go1.18.1
Node(s) CPU architecture, OS, and Version:
Five RPI 4s Running Headless 64-bit Raspbian, each with following information
Linux 5.15.56-v8+ #1575 SMP PREEMPT Fri Jul 22 20:31:26 BST 2022 aarch64 GNU/Linux
Cluster Configuration:
3 Nodes configured as control plane, 2 Nodes as Worker Nodes
Describe the bug:
The Pods: coredns-b96499967-ktgtc, local-path-provisioner-7b7dc8d6f5-5cfds, metrics-server-668d979685-9szb9, traefik-7cd4fcff68-gfmhm, and svclb-traefik-aa9f6b38-j27sw are at status unknown, with 0/1 pods ready. What this means is that the Cluster DNS service does not work and therefore that pods not are not able to resolve internal or external names
Steps To Reproduce:
Expected behavior:
The Important pods should be running, wit known status. Additionally, DNS should work, which means that, among other things headless services should work, and pods should be able to resolve hostnames inside and outside the cluster
Actual behavior:
DNS Pods Should be running with a known state, Pods should be able to resolve hostnames inside and outside the cluster, and headless services should be able to work
Additional context / logs:
Description of Relevant Pods:
The text was updated successfully, but these errors were encountered: