Empty cache files causes "KillPodSandbox" errors when deleting pods #8197

mrparkers · 2023-03-02T16:54:53Z

Description

I originally reported this issue at aws/amazon-vpc-cni-k8s#2283, but I was asked to post the issue here instead.

I'm running into a problem where a node will occasionally have a bunch of pods stuck in the Terminating state because the pod's sandbox can't be deleted:

Pod event:

Warning  FailedKillPod   43s (x229 over 49m)  kubelet  error killing pod: failed to "KillPodSandbox" for "a7ea2280-7361-4329-8566-89975485b20e" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": failed to get network \"aws-cni\" cached result: decoding version from network config: unexpected end of JSON input"

On the containerd side, I see similar logs. First, the reason given is cni plugin not initialized:

time="2023-03-01T20:08:10.167194705Z" level=info msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\""
time="2023-03-01T20:08:10.167346791Z" level=error msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\" failed" error="failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": cni plugin not initialized"

This repeats a few times, then the error message changes to unexpected end of JSON input

time="2023-03-01T20:08:34.067124024Z" level=error msg="StopPodSandbox for \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\" failed" error="failed to destroy network for sandbox \"4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399\": failed to get network \"aws-cni\" cached result: decoding version from network config: unexpected end of JSON input"

On the host itself, I noticed that the /var/lib/cni/results directory contains a cache file for the container with the ID 4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399 that's completely empty (zero bytes):

bash-5.1# ls -alh /var/lib/cni/results | grep 4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399
-rw-------. 1 root root    0 Mar 1 06:31 aws-cni-4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399-eth0

Meanwhile, the other aws-cni-*-eth0 files are not empty, and contain valid JSON.
So I believe the unexpected end of JSON input error message is caused by containerd attempting to json.Unmarshal an empty file, which I believe is happening here:

containerd/vendor/github.com/containernetworking/cni/pkg/types/create/create.go

Lines 31 to 34 in a217b5a

    
           err := json.Unmarshal(jsonBytes, &conf) 
        
           if err != nil { 
        
           	return "", fmt.Errorf("decoding version from network config: %w", err) 
        
           }

I am able to temporarily resolve the issue by deleting the empty cache file. The stuck pods are instantly able to be deleted as soon as I do that.

Another note: it looks like most of the caching logic is actually handled in https://github.com/containernetworking/cni, so perhaps this particular issue needs to be raised upstream. I'm not too sure.

My theory is that some other issue is causing the AWS VPC CNI plugin to crash or experience some issues, and containerd is mistakenly writing an empty cache file to /var/lib/cni/results while this is happening. Then, when the AWS VPC CNI plugin starts working again, this empty cache file is read, causing the error I'm seeing.

Steps to reproduce the issue

I wish I was able to reliably reproduce the underlying issue. The best I've been able to do is to delete the contents of one of the cache files within /var/lib/cni/results, which will reproduce the unexpected end of JSON input error message. However, I have been unable to reproduce the issue that causes the empty cache file to exist in the first place.

Describe the results you received and expected

I would expect for an empty cache file to never be written to /var/lib/cni/results. Or, if this does happen, perhaps containerd could catch this and remove the bad cache file before attempting to read from it.

What version of containerd are you using?

ctr github.com/containerd/containerd 1.6.15+bottlerocket

Any other relevant information

bash-5.1# runc --version
runc version 1.1.4+bottlerocket
commit: 5fd4c4d144137e991c4acebb2146ab1483a97925
spec: 1.0.2-dev
go: go1.19.4
libseccomp: 2.5.4

bash-5.1# uname -a
Linux ip-10-0-143-2.us-west-2.compute.internal 5.15.79 #1 SMP Wed Jan 25 03:35:23 UTC 2023 aarch64 GNU/Linux

bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.12.0 (aws-k8s-1.24)"
PRETTY_NAME="Bottlerocket OS 1.12.0 (aws-k8s-1.24)"
VARIANT_ID=aws-k8s-1.24
VERSION_ID=1.12.0
BUILD_ID=6ef1139f
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"

CNI Plugin is https://github.com/aws/amazon-vpc-cni-k8s v1.12.0

Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:36:36Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.10-eks-48e63af", GitCommit:"9176fb99b52f8d5ff73d67fea27f3a638f679f8a", GitTreeState:"clean", BuildDate:"2023-01-24T19:17:48Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Show configuration if it is related to CRI plugin.

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
disabled_plugins = [
    "io.containerd.internal.v1.opt",
    "io.containerd.snapshotter.v1.aufs",
    "io.containerd.snapshotter.v1.devmapper",
    "io.containerd.snapshotter.v1.native",
    "io.containerd.snapshotter.v1.zfs",
]

[grpc]
address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri"]
enable_selinux = true
# Pause container image is specified here, shares the same image as kubelet's pod-infra-container-image
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.1-eksbuild.1"

[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "shimpei"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.shimpei]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/cri-base.json"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.shimpei.options]
SystemdCgroup = true
BinaryName = "shimpei"

[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://docker-registry.redacted.com"]

The text was updated successfully, but these errors were encountered:

gbucknel · 2023-03-07T06:20:41Z

hi @mrparkers , I think I have the exact same thing as you. I have all the same versions and everything and am even using a registry mirror.

I noticed that the empty aws-cni file seems to have a creation time that seems coincident with a host restart ? (16:36:15 below) . I was wondering if you see the same thing?

bash-5.1# ls -la --time-style=full-iso /var/lib/cni/results/
total 16
drwx------. 2 root root 4096 2023-03-07 05:07:37.644843952 +0000 .
drwx------. 3 root root 4096 2023-03-05 16:12:59.115640029 +0000 ..
-rw-------. 1 root root 2289 2023-03-05 16:37:32.858972330 +0000 aws-cni-aae559259afb19d5086194c3e0fd58fd71a67619a285bf10c33a2d1ceccfb809-eth0
-rw-------. 1 root root    0 2023-03-05 16:36:15.930198021 +0000 aws-cni-b10821e595aef417667637f481b45b5d0183fbc2f6341f53a0e5059d506eaaf5-eth0

bash-5.1# journalctl --list-boots
 -1 3a275f535dc245a6ab9f1a0758dc90d8 Sun 2023-03-05 16:12:30 UTC—Sun 2023-03-05 16:35:46 UTC
  0 d940a617b7c74bcd9676d2ffd02679c0 Sun 2023-03-05 16:37:15 UTC—Tue 2023-03-07 06:11:56 UTC

The previous boot logs run out at 16:35:46 , and then the new boot log starts at 16:37:15.
Somehow the cni results file is created in between that. It feels like a new issue in 1.24 to me, I don't think this happened in 1.23.

mrparkers · 2023-03-07T18:46:02Z

Hi @gbucknel, thanks for sharing your experience here. The information you provided is really useful, I didn't think to check the boot logs when I was debugging this last week. However, this was one of the events on the node that I was troubleshooting:

Warning  Rebooted                 23m                   kubelet     Node ip-10-0-159-84.us-west-2.compute.internal has been rebooted, boot id: c492843a-e20f-4f07-a811-cba4401eed96

So it's possible that this was caused by a reboot of the node too, although I don't have access to the boot logs of this node to confirm (the node is long gone).

Unfortunately, I have been unable to reproduce this manually, but the next time this happens I'll check those boot logs and respond here.

Are you also running Bottlerocket by chance?

gbucknel · 2023-03-08T04:28:37Z

hi @mrparkers , yeah we're on Bottlerocket. It almost looks like we're on the same cluster! ;)

Just for completeness :

bash-5.1# runc --version
runc version 1.1.4+bottlerocket
commit: 5fd4c4d144137e991c4acebb2146ab1483a97925
spec: 1.0.2-dev
go: go1.19.4
libseccomp: 2.5.4

bash-5.1# uname -a
Linux 10.208.43.221 5.15.79 #1 SMP Wed Jan 25 03:35:24 UTC 2023 x86_64 GNU/Linux
bash-5.1#

bash-5.1# cat /etc/os-release
NAME=Bottlerocket
ID=bottlerocket
VERSION="1.12.0 (aws-k8s-1.24)"
PRETTY_NAME="Bottlerocket OS 1.12.0 (aws-k8s-1.24)"
VARIANT_ID=aws-k8s-1.24
VERSION_ID=1.12.0
BUILD_ID=6ef1139f
HOME_URL="https://github.com/bottlerocket-os/bottlerocket"
SUPPORT_URL="https://github.com/bottlerocket-os/bottlerocket/discussions"
BUG_REPORT_URL="https://github.com/bottlerocket-os/bottlerocket/issues"

The thing I changed today was the CNI image, I was on 1.12.0 like you but updated one of my clusters to 1.12.5 today. It seems like some sort of race condition thing to me given it seems to happen during boot?

These two PRs in 1.12.1 look like they are related to initialisation (if you squint hard enough! ) , so thought it was worth a try.
aws/amazon-vpc-cni-k8s#2138
aws/amazon-vpc-cni-k8s#2145

I'll write back once I've tested it some more.

jdn5126 · 2023-03-08T16:47:30Z

@gbucknel the CNI version should not matter here, as writes to /var/lib/cni/results are all on the containerd side of the contract. I think we'll need someone from containerd to help investigate the behavior around reload

gbucknel · 2023-03-17T01:02:13Z

hi @jdn5126 cool, yes, I still see if with cni 1.12.5 . Is there any more info I can gather around this to help ?

jdn5126 · 2023-03-17T15:52:47Z

I am not sure. At least from a CNI perspective, I cannot think of any. I think we just need a containerd maintainer to help here

dims · 2024-02-14T14:25:58Z

xref: containernetworking/cni#1055

dims · 2024-02-14T22:35:50Z

xref: k3s-io/k3s#6185

ryan-beisner · 2024-07-29T22:46:53Z

Is this issue solved by containernetworking/cni#1072? If so, can we rev up libcni for testing on the next release? @dims for feedback.

cartermckinnon · 2024-08-08T16:55:02Z

@ryan-beisner that libcni fix looks legit to me, it was pulled into main by #10106

older containerd release branches are still on 1.1.x of libcni, which doesn't have this fix yet. AFAICT, libcni doesn't do release branches, so I don't see a way to get the fix cherry-picked there. On this end, not sure if #10106 can be picked to release/1.7.

dosubot · 2024-12-25T16:08:33Z

Hi, @mrparkers. I'm Dosu, and I'm helping the containerd team manage their backlog. I'm marking this issue as stale.

Issue Summary

Pods are getting stuck in the Terminating state due to "KillPodSandbox" errors.
The issue may be linked to network configuration issues and node reboots, suggesting a potential race condition.
A potential fix related to libcni has been integrated into the main branch but is not yet available in older release branches.
Further testing with the updated libcni is suggested to verify the fix.

Next Steps

Please let us know if this issue is still relevant to the latest version of containerd by commenting on this issue.
If there is no further activity, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

mrparkers added the kind/bug label Mar 2, 2023

mrparkers mentioned this issue Mar 2, 2023

VPC CNI plugin creates empty cache files causing "KillPodSandbox" errors when deleting pods aws/amazon-vpc-cni-k8s#2283

Closed

dims mentioned this issue Feb 14, 2024

Failed to destroy the network because of empty cached cni result. containernetworking/cni#1055

Closed

henry118 mentioned this issue Feb 14, 2024

Configurable CNI Cache Dir #9825

Closed

bcressey mentioned this issue Feb 21, 2024

clean up CNI results cache on boot bottlerocket-os/bottlerocket#3792

Merged

cartermckinnon mentioned this issue Aug 8, 2024

bug(al2023): reset CNI cache on boot awslabs/amazon-eks-ami#1912

Merged

dosubot bot added the Stale label Dec 25, 2024

dmcgowan added this to Issue Management Jan 22, 2025

github-project-automation bot moved this to Todo in Issue Management Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty cache files causes "KillPodSandbox" errors when deleting pods #8197

Empty cache files causes "KillPodSandbox" errors when deleting pods #8197

mrparkers commented Mar 2, 2023 •

edited

Loading

gbucknel commented Mar 7, 2023 •

edited

Loading

mrparkers commented Mar 7, 2023

gbucknel commented Mar 8, 2023

jdn5126 commented Mar 8, 2023

gbucknel commented Mar 17, 2023

jdn5126 commented Mar 17, 2023

dims commented Feb 14, 2024

dims commented Feb 14, 2024

ryan-beisner commented Jul 29, 2024

cartermckinnon commented Aug 8, 2024 •

edited

Loading

dosubot bot commented Dec 25, 2024

Empty cache files causes "KillPodSandbox" errors when deleting pods #8197

Empty cache files causes "KillPodSandbox" errors when deleting pods #8197

Comments

mrparkers commented Mar 2, 2023 • edited Loading

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of containerd are you using?

Any other relevant information

Show configuration if it is related to CRI plugin.

gbucknel commented Mar 7, 2023 • edited Loading

mrparkers commented Mar 7, 2023

gbucknel commented Mar 8, 2023

jdn5126 commented Mar 8, 2023

gbucknel commented Mar 17, 2023

jdn5126 commented Mar 17, 2023

dims commented Feb 14, 2024

dims commented Feb 14, 2024

ryan-beisner commented Jul 29, 2024

cartermckinnon commented Aug 8, 2024 • edited Loading

dosubot bot commented Dec 25, 2024

mrparkers commented Mar 2, 2023 •

edited

Loading

gbucknel commented Mar 7, 2023 •

edited

Loading

cartermckinnon commented Aug 8, 2024 •

edited

Loading