Multiple Kube-System pods not running with Unknown Status #6185

oheckmann · 2022-09-28T17:16:20Z

Environmental Info:
K3s Version:
k3s version v1.24.3+k3s1 (990ba0e)
go version go1.18.1

Node(s) CPU architecture, OS, and Version:
Five RPI 4s Running Headless 64-bit Raspbian, each with following information
Linux 5.15.56-v8+ #1575 SMP PREEMPT Fri Jul 22 20:31:26 BST 2022 aarch64 GNU/Linux

Cluster Configuration:
3 Nodes configured as control plane, 2 Nodes as Worker Nodes

Describe the bug:
The Pods: coredns-b96499967-ktgtc, local-path-provisioner-7b7dc8d6f5-5cfds, metrics-server-668d979685-9szb9, traefik-7cd4fcff68-gfmhm, and svclb-traefik-aa9f6b38-j27sw are at status unknown, with 0/1 pods ready. What this means is that the Cluster DNS service does not work and therefore that pods not are not able to resolve internal or external names

Steps To Reproduce:

Installed K3s in HA mode using following instructions: https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/

Expected behavior:
The Important pods should be running, wit known status. Additionally, DNS should work, which means that, among other things headless services should work, and pods should be able to resolve hostnames inside and outside the cluster

Actual behavior:
DNS Pods Should be running with a known state, Pods should be able to resolve hostnames inside and outside the cluster, and headless services should be able to work

Additional context / logs:

kubectl -n kube-system get configmap coredns -o go-template={{.data.Corefile}}

.:53 {
    errors
    health
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods insecure
      fallthrough in-addr.arpa ip6.arpa
    }
    hosts /etc/coredns/NodeHosts {
      ttl 60
      reload 15s
      fallthrough
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}
import /etc/coredns/custom/*.server

Description of Relevant Pods:

kubectl describe  pods --namespace=kube-system
Name:                 coredns-b96499967-ktgtc
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 master0/192.168.0.68
Start Time:           Fri, 05 Aug 2022 16:09:38 +0100
Labels:               k8s-app=kube-dns
                      pod-template-hash=b96499967
Annotations:          <none>
Status:               Running
IP:
IPs:                  <none>
Controlled By:        ReplicaSet/coredns-b96499967
Containers:
  coredns:
    Container ID:  containerd://1a83a59275abdb7b783aa06eb56cb1e5367c1ca196598851c2b7d5154c0a4bb9
    Image:         rancher/mirrored-coredns-coredns:1.9.1
    Image ID:      docker.io/rancher/mirrored-coredns-coredns@sha256:35e38f3165a19cb18c65d83334c13d61db6b24905f45640aa8c2d2a6f55ebcb0
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Fri, 05 Aug 2022 19:19:19 +0100
      Finished:     Fri, 05 Aug 2022 19:20:29 +0100
    Ready:          False
    Restart Count:  8
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get http://:8181/ready delay=0s timeout=1s period=2s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /etc/coredns/custom from custom-config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zbbxf (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  custom-config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns-custom
    Optional:  true
  kube-api-access-zbbxf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              beta.kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age                    From     Message
  ----    ------          ----                   ----     -------
  Normal  SandboxChanged  41d (x419 over 41d)    kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  SandboxChanged  64m (x11421 over 42h)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  SandboxChanged  2m24s (x139 over 32m)  kubelet  Pod sandbox changed, it will be killed and re-created.

Name:                 metrics-server-668d979685-9szb9
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 master0/192.168.0.68
Start Time:           Fri, 05 Aug 2022 16:09:38 +0100
Labels:               k8s-app=metrics-server
                      pod-template-hash=668d979685
Annotations:          <none>
Status:               Running
IP:
IPs:                  <none>
Controlled By:        ReplicaSet/metrics-server-668d979685
Containers:
  metrics-server:
    Container ID:  containerd://cd02643f7d7bc78ea98abdec20558626cfac39f70e1127b2281342dd00905e44
    Image:         rancher/mirrored-metrics-server:v0.5.2
    Image ID:      docker.io/rancher/mirrored-metrics-server@sha256:48ecad4fe641a09fa4459f93c7ad29d4916f6b9cf7e934d548f1d8eff96e2f35
    Port:          4443/TCP
    Host Port:     0/TCP
    Args:
      --cert-dir=/tmp
      --secure-port=4443
      --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname
      --kubelet-use-node-status-port
      --metric-resolution=15s
    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Fri, 05 Aug 2022 19:19:19 +0100
      Finished:     Fri, 05 Aug 2022 19:20:29 +0100
    Ready:          False
    Restart Count:  8
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get https://:https/livez delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:https/readyz delay=0s timeout=1s period=2s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /tmp from tmp-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-djqgk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tmp-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-djqgk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age                    From     Message
  ----    ------          ----                   ----     -------
  Normal  SandboxChanged  41d (x418 over 41d)    kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  SandboxChanged  64m (x11427 over 42h)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  SandboxChanged  2m27s (x141 over 32m)  kubelet  Pod sandbox changed, it will be killed and re-created.


Name:                 traefik-7cd4fcff68-gfmhm
Namespace:            kube-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 master0/192.168.0.68
Start Time:           Fri, 05 Aug 2022 16:10:43 +0100
Labels:               app.kubernetes.io/instance=traefik
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=traefik
                      helm.sh/chart=traefik-10.19.300
                      pod-template-hash=7cd4fcff68
Annotations:          prometheus.io/path: /metrics
                      prometheus.io/port: 9100
                      prometheus.io/scrape: true
Status:               Running
IP:
IPs:                  <none>
Controlled By:        ReplicaSet/traefik-7cd4fcff68
Containers:
  traefik:
    Container ID:  containerd://779a1596fb204a7577acda97e9fb3f4c5728cf1655071d8e5faad6a8d407d217
    Image:         rancher/mirrored-library-traefik:2.6.2
    Image ID:      docker.io/rancher/mirrored-library-traefik@sha256:ad2226527eea71b7591d5e9dcc0bffd0e71b2235420c34f358de6db6d529561f
    Ports:         9100/TCP, 9000/TCP, 8000/TCP, 8443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP
    Args:
      --global.checknewversion
      --global.sendanonymoususage
      --entrypoints.metrics.address=:9100/tcp
      --entrypoints.traefik.address=:9000/tcp
      --entrypoints.web.address=:8000/tcp
      --entrypoints.websecure.address=:8443/tcp
      --api.dashboard=true
      --ping=true
      --metrics.prometheus=true
      --metrics.prometheus.entrypoint=metrics
      --providers.kubernetescrd
      --providers.kubernetesingress
      --providers.kubernetesingress.ingressendpoint.publishedservice=kube-system/traefik
      --entrypoints.websecure.http.tls=true
    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Fri, 05 Aug 2022 19:19:19 +0100
      Finished:     Fri, 05 Aug 2022 19:20:29 +0100
    Ready:          False
    Restart Count:  8
    Liveness:       http-get http://:9000/ping delay=10s timeout=2s period=10s #success=1 #failure=3
    Readiness:      http-get http://:9000/ping delay=10s timeout=2s period=10s #success=1 #failure=1
    Environment:    <none>
    Mounts:
      /data from data (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jw4qc (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-jw4qc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age                    From     Message
  ----    ------          ----                   ----     -------
  Normal  SandboxChanged  41d (x415 over 41d)    kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  SandboxChanged  64m (x11418 over 42h)  kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  SandboxChanged  2m30s (x141 over 32m)  kubelet  Pod sandbox changed, it will be killed and re-created.

The text was updated successfully, but these errors were encountered:

brandond · 2022-09-28T18:17:32Z

The kubectl output you provided doesn't really give enough info to figure out why the pods are not ready. I would probably check the k3s and containerd logs on the nodes hosting those pods, there may be some useful messages in there that will suggest what to investigate next.

oheckmann · 2022-09-30T15:35:16Z

The Solution that I found to resolve the problem - at least for now, was to manually restart all of the kube-system deployments found using the command deployments

kubectl get deployments --namespace=kube-system
If all of them are similarly not ready they can be restarted using the command

kubectl -n kube-system rollout restart <deployment>
Specifically, coredns, local-path-provisioner, metrics-server, and traefik deployments all needed to be restarted

brandond · 2022-09-30T15:38:33Z

Were you able to collect any logs that might shed some light on why they were in that state to begin with?

vwbusguy · 2022-12-02T00:30:26Z

I ran into this today. journald had a bunch of these logs:

Dec 01 16:07:20 awx-01 k3s[49238]: E1201 16:07:20.627926   49238 remote_runtime.go:269] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"7c6aaa960b421e9046bd5dc2ff4e960f79698e4867453259dfcc2dbef51340a0\": failed to get network \"cbr0\" cached result: decoding version from network config: unexpected end of JSON input" podSandboxID="7c6aaa960b421e9046bd5dc2ff4e960f79698e4867453259dfcc2dbef51340a0"
Dec 01 16:07:20 awx-01 k3s[49238]: E1201 16:07:20.627990   49238 kuberuntime_gc.go:176] "Failed to stop sandbox before removing" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"7c6aaa960b421e9046bd5dc2ff4e960f79698e4867453259dfcc2dbef51340a0\": failed to get network \"cbr0\" cached result: decoding version from network config: unexpected end of JSON input" sandboxID="7c6aaa960b421e9046bd5dc2ff4e960f79698e4867453259dfcc2dbef51340a0"
Dec 01 16:07:20 awx-01 k3s[49238]: E1201 16:07:20.632079   49238 remote_runtime.go:269] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"82b9ed4c23e2de59d9db6f600d7ebee4f1f4af9ed0842d2267af6a3dee27584c\": failed to get network \"cbr0\" cached result: decoding version from network config: unexpected end of JSON input" podSandboxID="82b9ed4c23e2de59d9db6f600d7ebee4f1f4af9ed0842d2267af6a3dee27584c"
Dec 01 16:07:20 awx-01 k3s[49238]: E1201 16:07:20.632142   49238 kuberuntime_gc.go:176] "Failed to stop sandbox before removing" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"82b9ed4c23e2de59d9db6f600d7ebee4f1f4af9ed0842d2267af6a3dee27584c\": failed to get network \"cbr0\" cached result: decoding version from network config: unexpected end of JSON input" sandboxID="82b9ed4c23e2de59d9db6f600d7ebee4f1f4af9ed0842d2267af6a3dee27584c"
Dec 01 16:07:20 awx-01 k3s[49238]: E1201 16:07:20.637244   49238 remote_runtime.go:269] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"62d42d4e8695e8b80de867ea889f9456144cc2803ecf3d02109f8160c192d98d\": failed to get network \"cbr0\" cached result: decoding version from network config: unexpected end of JSON input" podSandboxID="62d42d4e8695e8b80de867ea889f9456144cc2803ecf3d02109f8160c192d98d"
Dec 01 16:07:20 awx-01 k3s[49238]: E1201 16:07:20.637305   49238 kuberuntime_gc.go:176] "Failed to stop sandbox before removing" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"62d42d4e8695e8b80de867ea889f9456144cc2803ecf3d02109f8160c192d98d\": failed to get network \"cbr0\" cached result: decoding version from network config: unexpected end of JSON input" sandboxID="62d42d4e8695e8b80de867ea889f9456144cc2803ecf3d02109f8160c192d98d"

Redeploying the kube-system stuff and manually removing the pods that were stuck in Terminating eventually got things going again here. Ref. containerd/containerd#4016 (comment)

brandond · 2022-12-02T00:32:53Z

@vwbusguy what version of K3s are you on? That's a pretty old issue.

vwbusguy · 2022-12-02T05:56:03Z

v1.25.4+k3s1

brandond · 2022-12-02T06:03:30Z

Any idea what happened on your host to trigger that? Sounds like something corrupted or truncated the CNI state files on disk.

dsvk · 2022-12-10T12:19:57Z

@brandond would repeated kernel panic segfaults on the hypervisor requiring hard reboots sound like a plausible root cause? I just ended up with this - one pod from the affected node permanently stuck Terminating. Running v1.24.8+k3s1.

Warning  FailedKillPod  37m (x17422 over 2d15h)  kubelet  error killing pod: failed to "KillPodSandbox" for "4ddeca71-5cfe-49e4-a699-44f669fbf4b9" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"e7872ac1093a764da1d9d5e43047aedccc883b43851485916a69785304cbfabf\": failed to get network \"cbr0\" cached result: decoding version from network config: unexpected end of JSON input"
Warning  FailedKillPod  7s (x165 over 35m)       kubelet  error killing pod: failed to "KillPodSandbox" for "4ddeca71-5cfe-49e4-a699-44f669fbf4b9" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"e7872ac1093a764da1d9d5e43047aedccc883b43851485916a69785304cbfabf\": failed to get network \"cbr0\" cached result: decoding version from network config: unexpected end of JSON input"

However, I am still not able to get rid of the pod - the container ID does not appear in k3s ctr containers ls and nor do any other unknown/suspicious entries so there's nothing left to delete there. Pod force-delete attempts don't do anything either, including after having gracefully rebooted all cluster nodes and completely deleted and recreated the related Deployment.

vwbusguy · 2022-12-11T17:30:30Z

Any idea what happened on your host to trigger that?

Yes - a recently botched selinux update with CentOS Stream 9. The policy failed to compile and much of the k3s filesystem stopped becoming writable as SELinux started blocking everything. SELinux was disabled at grub and then I had to work through these errors to recover it. Been running it with selinux enabled for a year on CS9 and this is the first time something like this has happened, but I'm not sure if it's k3s or cs9 related, but in any case, the cleanup recovery could have been done more elegantly from the k3s side of things since the lower level calls worked just fine.

chrisguidry · 2023-01-22T14:21:50Z

One of the nodes of my v1.24.3+k3s1 cluster was spamming these messages in the journal for a couple of months until I was able to finally clean this up.

E0122 14:06:28.459287     872 remote_runtime.go:248] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox \"553045725730f43da0dc5c0d8b7715f4fef9dea094b4840a2920042a18c8087a\": failed to get network \"cbr0\" cached result: decoding version from network config: unexpected end of JSON input" podSandboxID="553045725730f43da0dc5c0d8b7715f4fef9dea094b4840a2920042a18c8087a"

For other folks who might encounter this, you can make the journal spam stop by following these steps:

Collect up all the podSandboxID identifiers from your logs. You might see a block of them repeating every minute or so.
For each one of those, use the k3s ctr command mentioned in the comments above to remove the containers: sudo k3s ctr container rm <pod-one> <pod-two> ....
As root, head to /var/lib/cni
Look for any zero-length files in the /var/lib/cni/flannel directory. They should match the pod IDs you collected from your logs. Remove all of these.
Look for any zero-length files in the /var/lib/cni/results directory. These are the "cached result" files that are mentioned in the logs. You'll see the filenames have your pod IDs in them. Remove all of these.

This worked to stop the journal spam for me! All of the mentioned files were two months old from an...episode that my cluster and I had back then. The machine had been rebooted a few times since then, and containerd had been restarted a few times. The problem was definitely the zero-length files under /var/lib/cni, but I'm not entirely sure what caused them.

ChrisIgel · 2023-04-04T13:17:24Z

I'm having the same problem after some restarts of k3s/containerd and doing what @chrisguidry described in #6185 (comment) seems to work (although I had to explicitly remove the old pod via kubectl/crictl).
I'd love if we could evaluate on how or when this constellation occurs to be able to resolve the root cause. Could the problem be an ungraceful shutdown of k3s/containerd?

harrison-tin · 2023-04-05T19:00:15Z

I am also facing this issue when I restart the node and call kubectl get pods -A before the API server is ready. I have a script file that goes

restart the node that's running k3s
have a while loop to check kubectl get pods -A. It would keep getting errors until the cluster is ready
once the cluster is ready, I check the status of the kube-system pods and they are all in unknow state. The logs show the same error message as @chrisguidry's comment

Just providing more info to see if we could find the issue. the k3s version is v1.24.6

brandond · 2023-04-05T19:12:36Z

I am also facing this issue when I restart the node and call kubectl get pods -A before the API server is ready.

Don't do that? You can't reason about the state of the cluster before all the components, including the kubelets and controller-manager, are up and ready.

ChrisIgel · 2023-04-06T15:10:54Z

@harrison-tin Is this behavior reproducible for you? So when you restart the node and call the k3s API before the cluster is ready, will the pods always (and forever until a manual fix) be stuck in the unknown state?

harrison-tin · 2023-04-06T17:42:24Z

@ChrisIgel the behavior is always reproducible (always stuck in unknown state) when using my script file (bring up cluster -> restart node -> call k3s API). But if I add some sleep times between the API calls, I don't observe the unknown state anymore.

ChrisIgel · 2023-04-07T15:35:37Z

Okay, I've tried multiple different things but the behavior you describe is not reproducible for me, so it seems we're having two different issues :(

wenerme · 2023-06-07T17:34:18Z

Copy paste

k3s ctr container rm $(find /var/lib/cni/flannel/ -size 0 | xargs -n1 basename)

find /var/lib/cni/flannel/ -size 0 -delete
find /var/lib/cni/results/ -size 0 -delete # cached result

from #6185 (comment)

erkki · 2023-07-07T16:06:17Z

Seeing this on v1.24.7-k3s1.

Happens every now and then on a fleet of hundreds of k3s clusters.
Suspect it's related to hard poweroff of machines but no clear way to reproduce.

discordianfish · 2023-11-27T12:44:06Z

To me this happend after a hard poweroff as well. Would be nice if it could recover on it's own. Sounds like deleting corrupted files that are there for caching should be safe.

erkki · 2023-11-27T16:01:36Z

To me this happend after a hard poweroff as well. Would be nice if it could recover on it's own. Sounds like deleting corrupted files that are there for caching should be safe.

Yep, indeed, we are now rolling out external mitigations deleting the files on boot (seeing this quite a lot now, like 5-10 a week, due to having many clusters in power instable locations.).

Given the failure mode of zero byte files, it's probably a missing fsync somewhere in K3S code (by default ext4 journals metadata only not data).

Given the ephemeral nature of the file data, K3S should indeed be able to detect and recover.

We are planning to look into the code and repro and fix and send a PR but have not have had the cycles yet.

brandond · 2023-11-27T18:44:47Z

it's probably a missing fsync somewhere in K3S code

Note that the affected code is packaged and distributed by this project, but IS NOT part of our codebase. You'll want to look at the upstream projects (containerd, flannel, and so on) to solve these issues.

tuupola · 2024-03-01T10:35:27Z

Also seeing this on v1.26.7+k3s1. Most likely happened after hard poweroff. Removing the zero byte file does not help. However #6185 (comment) does fix the problem.

tuupola · 2024-03-01T10:43:30Z

it's probably a missing fsync somewhere in K3S code

Note that the affected code is packaged and distributed by this project, but IS NOT part of our codebase. You'll want to look at the upstream projects (containerd, flannel, and so on) to solve these issues.

Apparently this is fixed in Flannel CNI plugin release 1.1.2.

flannel-io/flannel#1662 (comment)

brandond · 2024-03-04T20:15:54Z

@tuupola this should be addressed by

Use and version flannel/cni-plugin properly #9635

ad-zsolt-imre · 2024-05-01T19:18:20Z

Same issue with latest k3s. I don't think any of the nodes were shut down.

Client Version: v1.29.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.3+k3s1

brandond · 2024-05-01T22:25:00Z

@ad-zsolt-imre please open a new issue if you are having problems. Fill out the issue template and provide steps to reproduce if possible.

oheckmann closed this as completed Sep 30, 2022

rhuss mentioned this issue Sep 13, 2023

How to cleanup a Kine sqlite DB ? k3s-io/kine#213

Closed

dims mentioned this issue Feb 14, 2024

Empty cache files causes "KillPodSandbox" errors when deleting pods containerd/containerd#8197

Open

brandond mentioned this issue Mar 4, 2024

Drop flannel/cni-plugin from our fork of containernetworking/plugins #9636

Closed

k3s-io locked and limited conversation to collaborators May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Kube-System pods not running with Unknown Status #6185

Multiple Kube-System pods not running with Unknown Status #6185

oheckmann commented Sep 28, 2022 •

edited

Loading

brandond commented Sep 28, 2022

oheckmann commented Sep 30, 2022

brandond commented Sep 30, 2022

vwbusguy commented Dec 2, 2022 •

edited

Loading

brandond commented Dec 2, 2022

vwbusguy commented Dec 2, 2022

brandond commented Dec 2, 2022

dsvk commented Dec 10, 2022

vwbusguy commented Dec 11, 2022

chrisguidry commented Jan 22, 2023 •

edited

Loading

ChrisIgel commented Apr 4, 2023

harrison-tin commented Apr 5, 2023

brandond commented Apr 5, 2023 •

edited

Loading

ChrisIgel commented Apr 6, 2023

harrison-tin commented Apr 6, 2023

ChrisIgel commented Apr 7, 2023

wenerme commented Jun 7, 2023

erkki commented Jul 7, 2023 •

edited

Loading

discordianfish commented Nov 27, 2023

erkki commented Nov 27, 2023

brandond commented Nov 27, 2023

tuupola commented Mar 1, 2024 •

edited

Loading

tuupola commented Mar 1, 2024

brandond commented Mar 4, 2024 •

edited

Loading

ad-zsolt-imre commented May 1, 2024

brandond commented May 1, 2024

Multiple Kube-System pods not running with Unknown Status #6185

Multiple Kube-System pods not running with Unknown Status #6185

Comments

oheckmann commented Sep 28, 2022 • edited Loading

brandond commented Sep 28, 2022

oheckmann commented Sep 30, 2022

brandond commented Sep 30, 2022

vwbusguy commented Dec 2, 2022 • edited Loading

brandond commented Dec 2, 2022

vwbusguy commented Dec 2, 2022

brandond commented Dec 2, 2022

dsvk commented Dec 10, 2022

vwbusguy commented Dec 11, 2022

chrisguidry commented Jan 22, 2023 • edited Loading

ChrisIgel commented Apr 4, 2023

harrison-tin commented Apr 5, 2023

brandond commented Apr 5, 2023 • edited Loading

ChrisIgel commented Apr 6, 2023

harrison-tin commented Apr 6, 2023

ChrisIgel commented Apr 7, 2023

wenerme commented Jun 7, 2023

erkki commented Jul 7, 2023 • edited Loading

discordianfish commented Nov 27, 2023

erkki commented Nov 27, 2023

brandond commented Nov 27, 2023

tuupola commented Mar 1, 2024 • edited Loading

tuupola commented Mar 1, 2024

brandond commented Mar 4, 2024 • edited Loading

ad-zsolt-imre commented May 1, 2024

brandond commented May 1, 2024

oheckmann commented Sep 28, 2022 •

edited

Loading

vwbusguy commented Dec 2, 2022 •

edited

Loading

chrisguidry commented Jan 22, 2023 •

edited

Loading

brandond commented Apr 5, 2023 •

edited

Loading

erkki commented Jul 7, 2023 •

edited

Loading

tuupola commented Mar 1, 2024 •

edited

Loading

brandond commented Mar 4, 2024 •

edited

Loading