Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty cache files causes "KillPodSandbox" errors when deleting pods #8197

Open
mrparkers opened this issue Mar 2, 2023 · 11 comments
Open

Empty cache files causes "KillPodSandbox" errors when deleting pods #8197

mrparkers opened this issue Mar 2, 2023 · 11 comments

Comments

@mrparkers
Copy link

mrparkers commented Mar 2, 2023

Description

I originally reported this issue at aws/amazon-vpc-cni-k8s#2283, but I was asked to post the issue here instead.

I'm running into a problem where a node will occasionally have a bunch of pods stuck in the Terminating state because the pod's sandbox can't be deleted:

Pod event:

Warning  FailedKillPod   43s (x229 over 49m)  kubelet  error killing pod: failed to "KillPodSandbox" for "a7ea2280-7361-4329-8566-89975485b20e" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": failed to get network \"aws-cni\" cached result: decoding version from network config: unexpected end of JSON input"

On the containerd side, I see similar logs. First, the reason given is cni plugin not initialized:

time="2023-03-01T20:08:10.167194705Z" level=info msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\""
time="2023-03-01T20:08:10.167346791Z" level=error msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\" failed" error="failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": cni plugin not initialized"

This repeats a few times, then the error message changes to unexpected end of JSON input

time="2023-03-01T20:08:34.067124024Z" level=error msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\" failed" error="failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": failed to get network \"aws-cni\" cached result: decoding version from network config: unexpected end of JSON input"

On the host itself, I noticed that the /var/lib/cni/results directory contains a cache file for the container with the ID 4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399 that's completely empty (zero bytes):

bash-5.1# ls -alh /var/lib/cni/results | grep 4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399
-rw-------. 1 root root    0 Mar 1 06:31 aws-cni-4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399-eth0

Meanwhile, the other aws-cni-*-eth0 files are not empty, and contain valid JSON.
So I believe the unexpected end of JSON input error message is caused by containerd attempting to json.Unmarshal an empty file, which I believe is happening here:

err := json.Unmarshal(jsonBytes, &conf)
if err != nil {
return "", fmt.Errorf("decoding version from network config: %w", err)
}

I am able to temporarily resolve the issue by deleting the empty cache file. The stuck pods are instantly able to be deleted as soon as I do that.

Another note: it looks like most of the caching logic is actually handled in https://github.com/containernetworking/cni, so perhaps this particular issue needs to be raised upstream. I'm not too sure.


My theory is that some other issue is causing the AWS VPC CNI plugin to crash or experience some issues, and containerd is mistakenly writing an empty cache file to /var/lib/cni/results while this is happening. Then, when the AWS VPC CNI plugin starts working again, this empty cache file is read, causing the error I'm seeing.

Steps to reproduce the issue

I wish I was able to reliably reproduce the underlying issue. The best I've been able to do is to delete the contents of one of the cache files within /var/lib/cni/results, which will reproduce the unexpected end of JSON input error message. However, I have been unable to reproduce the issue that causes the empty cache file to exist in the first place.

Describe the results you received and expected

I would expect for an empty cache file to never be written to /var/lib/cni/results. Or, if this does happen, perhaps containerd could catch this and remove the bad cache file before attempting to read from it.

What version of containerd are you using?

ctr github.com/containerd/containerd 1.6.15+bottlerocket

Any other relevant information

bash-5.1# runc --version
runc version 1.1.4+bottlerocket
commit: 5fd4c4d144137e991c4acebb2146ab1483a97925
spec: 1.0.2-dev
go: go1.19.4
libseccomp: 2.5.4

bash-5.1# uname -a
Linux ip-10-0-143-2.us-west-2.compute.internal 5.15.79 #1 SMP Wed Jan 25 03:35:23 UTC 2023 aarch64 GNU/Linux

bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.12.0 (aws-k8s-1.24)"
PRETTY_NAME="Bottlerocket OS 1.12.0 (aws-k8s-1.24)"
VARIANT_ID=aws-k8s-1.24
VERSION_ID=1.12.0
BUILD_ID=6ef1139f
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"

CNI Plugin is https://github.com/aws/amazon-vpc-cni-k8s v1.12.0

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.10-eks-48e63af", GitCommit:"9176fb99b52f8d5ff73d67fea27f3a638f679f8a", GitTreeState:"clean", BuildDate:"2023-01-24T19:17:48Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Show configuration if it is related to CRI plugin.

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
disabled_plugins = [
    "io.containerd.internal.v1.opt",
    "io.containerd.snapshotter.v1.aufs",
    "io.containerd.snapshotter.v1.devmapper",
    "io.containerd.snapshotter.v1.native",
    "io.containerd.snapshotter.v1.zfs",
]

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri"]
enable_selinux = true
# Pause container image is specified here, shares the same image as kubelet's pod-infra-container-image
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.1-eksbuild.1"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "shimpei"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.shimpei]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/cri-base.json"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.shimpei.options]
SystemdCgroup = true
BinaryName = "shimpei"

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://docker-registry.redacted.com"]
@gbucknel
Copy link

gbucknel commented Mar 7, 2023

hi @mrparkers , I think I have the exact same thing as you. I have all the same versions and everything and am even using a registry mirror.

I noticed that the empty aws-cni file seems to have a creation time that seems coincident with a host restart ? (16:36:15 below) . I was wondering if you see the same thing?

bash-5.1# ls -la --time-style=full-iso /var/lib/cni/results/
total 16
drwx------. 2 root root 4096 2023-03-07 05:07:37.644843952 +0000 .
drwx------. 3 root root 4096 2023-03-05 16:12:59.115640029 +0000 ..
-rw-------. 1 root root 2289 2023-03-05 16:37:32.858972330 +0000 aws-cni-aae559259afb19d5086194c3e0fd58fd71a67619a285bf10c33a2d1ceccfb809-eth0
-rw-------. 1 root root    0 2023-03-05 16:36:15.930198021 +0000 aws-cni-b10821e595aef417667637f481b45b5d0183fbc2f6341f53a0e5059d506eaaf5-eth0
bash-5.1# journalctl --list-boots
 -1 3a275f535dc245a6ab9f1a0758dc90d8 Sun 2023-03-05 16:12:30 UTC—Sun 2023-03-05 16:35:46 UTC
  0 d940a617b7c74bcd9676d2ffd02679c0 Sun 2023-03-05 16:37:15 UTC—Tue 2023-03-07 06:11:56 UTC

The previous boot logs run out at 16:35:46 , and then the new boot log starts at 16:37:15.
Somehow the cni results file is created in between that. It feels like a new issue in 1.24 to me, I don't think this happened in 1.23.

@mrparkers
Copy link
Author

Hi @gbucknel, thanks for sharing your experience here. The information you provided is really useful, I didn't think to check the boot logs when I was debugging this last week. However, this was one of the events on the node that I was troubleshooting:

Warning  Rebooted                 23m                   kubelet     Node ip-10-0-159-84.us-west-2.compute.internal has been rebooted, boot id: c492843a-e20f-4f07-a811-cba4401eed96

So it's possible that this was caused by a reboot of the node too, although I don't have access to the boot logs of this node to confirm (the node is long gone).

Unfortunately, I have been unable to reproduce this manually, but the next time this happens I'll check those boot logs and respond here.

Are you also running Bottlerocket by chance?

@gbucknel
Copy link

gbucknel commented Mar 8, 2023

hi @mrparkers , yeah we're on Bottlerocket. It almost looks like we're on the same cluster! ;)

Just for completeness :

bash-5.1# runc --version
runc version 1.1.4+bottlerocket
commit: 5fd4c4d144137e991c4acebb2146ab1483a97925
spec: 1.0.2-dev
go: go1.19.4
libseccomp: 2.5.4

bash-5.1# uname -a
Linux 10.208.43.221 5.15.79 #1 SMP Wed Jan 25 03:35:24 UTC 2023 x86_64 GNU/Linux
bash-5.1#

bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.12.0 (aws-k8s-1.24)"
PRETTY_NAME="Bottlerocket OS 1.12.0 (aws-k8s-1.24)"
VARIANT_ID=aws-k8s-1.24
VERSION_ID=1.12.0
BUILD_ID=6ef1139f
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"

The thing I changed today was the CNI image, I was on 1.12.0 like you but updated one of my clusters to 1.12.5 today. It seems like some sort of race condition thing to me given it seems to happen during boot?

These two PRs in 1.12.1 look like they are related to initialisation (if you squint hard enough! ) , so thought it was worth a try.
aws/amazon-vpc-cni-k8s#2138
aws/amazon-vpc-cni-k8s#2145

I'll write back once I've tested it some more.

@jdn5126
Copy link

jdn5126 commented Mar 8, 2023

@gbucknel the CNI version should not matter here, as writes to /var/lib/cni/results are all on the containerd side of the contract. I think we'll need someone from containerd to help investigate the behavior around reload

@gbucknel
Copy link

hi @jdn5126 cool, yes, I still see if with cni 1.12.5 . Is there any more info I can gather around this to help ?

@jdn5126
Copy link

jdn5126 commented Mar 17, 2023

I am not sure. At least from a CNI perspective, I cannot think of any. I think we just need a containerd maintainer to help here

@dims
Copy link
Member

dims commented Feb 14, 2024

xref: containernetworking/cni#1055

@dims
Copy link
Member

dims commented Feb 14, 2024

xref: k3s-io/k3s#6185

@ryan-beisner
Copy link

Is this issue solved by containernetworking/cni#1072? If so, can we rev up libcni for testing on the next release? @dims for feedback.

@cartermckinnon
Copy link

cartermckinnon commented Aug 8, 2024

@ryan-beisner that libcni fix looks legit to me, it was pulled into main by #10106

older containerd release branches are still on 1.1.x of libcni, which doesn't have this fix yet. AFAICT, libcni doesn't do release branches, so I don't see a way to get the fix cherry-picked there. On this end, not sure if #10106 can be picked to release/1.7.

Copy link

dosubot bot commented Dec 25, 2024

Hi, @mrparkers. I'm Dosu, and I'm helping the containerd team manage their backlog. I'm marking this issue as stale.

Issue Summary

  • Pods are getting stuck in the Terminating state due to "KillPodSandbox" errors.
  • The issue may be linked to network configuration issues and node reboots, suggesting a potential race condition.
  • A potential fix related to libcni has been integrated into the main branch but is not yet available in older release branches.
  • Further testing with the updated libcni is suggested to verify the fix.

Next Steps

  • Please let us know if this issue is still relevant to the latest version of containerd by commenting on this issue.
  • If there is no further activity, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

6 participants