-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty cache files causes "KillPodSandbox" errors when deleting pods #8197
Comments
hi @mrparkers , I think I have the exact same thing as you. I have all the same versions and everything and am even using a registry mirror. I noticed that the empty aws-cni file seems to have a creation time that seems coincident with a host restart ? (16:36:15 below) . I was wondering if you see the same thing?
The previous boot logs run out at 16:35:46 , and then the new boot log starts at 16:37:15. |
Hi @gbucknel, thanks for sharing your experience here. The information you provided is really useful, I didn't think to check the boot logs when I was debugging this last week. However, this was one of the events on the node that I was troubleshooting:
So it's possible that this was caused by a reboot of the node too, although I don't have access to the boot logs of this node to confirm (the node is long gone). Unfortunately, I have been unable to reproduce this manually, but the next time this happens I'll check those boot logs and respond here. Are you also running Bottlerocket by chance? |
hi @mrparkers , yeah we're on Bottlerocket. It almost looks like we're on the same cluster! ;) Just for completeness :
The thing I changed today was the CNI image, I was on 1.12.0 like you but updated one of my clusters to 1.12.5 today. It seems like some sort of race condition thing to me given it seems to happen during boot? These two PRs in 1.12.1 look like they are related to initialisation (if you squint hard enough! ) , so thought it was worth a try. I'll write back once I've tested it some more. |
@gbucknel the CNI version should not matter here, as writes to |
hi @jdn5126 cool, yes, I still see if with cni |
I am not sure. At least from a CNI perspective, I cannot think of any. I think we just need a containerd maintainer to help here |
xref: k3s-io/k3s#6185 |
Is this issue solved by containernetworking/cni#1072? If so, can we rev up libcni for testing on the next release? @dims for feedback. |
@ryan-beisner that libcni fix looks legit to me, it was pulled into main by #10106 older containerd release branches are still on 1.1.x of libcni, which doesn't have this fix yet. AFAICT, libcni doesn't do release branches, so I don't see a way to get the fix cherry-picked there. On this end, not sure if #10106 can be picked to |
Hi, @mrparkers. I'm Dosu, and I'm helping the containerd team manage their backlog. I'm marking this issue as stale. Issue Summary
Next Steps
Thank you for your understanding and contribution! |
Description
I originally reported this issue at aws/amazon-vpc-cni-k8s#2283, but I was asked to post the issue here instead.
I'm running into a problem where a node will occasionally have a bunch of pods stuck in the
Terminating
state because the pod's sandbox can't be deleted:Pod event:
On the
containerd
side, I see similar logs. First, the reason given iscni plugin not initialized
:This repeats a few times, then the error message changes to
unexpected end of JSON input
On the host itself, I noticed that the
/var/lib/cni/results
directory contains a cache file for the container with the ID4285ab7ef1b33097068f6e2dfbf2a71c96129f45d0d6655d47d3dc80db5f0399
that's completely empty (zero bytes):Meanwhile, the other
aws-cni-*-eth0
files are not empty, and contain valid JSON.So I believe the
unexpected end of JSON input
error message is caused bycontainerd
attempting tojson.Unmarshal
an empty file, which I believe is happening here:containerd/vendor/github.com/containernetworking/cni/pkg/types/create/create.go
Lines 31 to 34 in a217b5a
I am able to temporarily resolve the issue by deleting the empty cache file. The stuck pods are instantly able to be deleted as soon as I do that.
Another note: it looks like most of the caching logic is actually handled in https://github.com/containernetworking/cni, so perhaps this particular issue needs to be raised upstream. I'm not too sure.
My theory is that some other issue is causing the AWS VPC CNI plugin to crash or experience some issues, and
containerd
is mistakenly writing an empty cache file to/var/lib/cni/results
while this is happening. Then, when the AWS VPC CNI plugin starts working again, this empty cache file is read, causing the error I'm seeing.Steps to reproduce the issue
I wish I was able to reliably reproduce the underlying issue. The best I've been able to do is to delete the contents of one of the cache files within
/var/lib/cni/results
, which will reproduce theunexpected end of JSON input
error message. However, I have been unable to reproduce the issue that causes the empty cache file to exist in the first place.Describe the results you received and expected
I would expect for an empty cache file to never be written to
/var/lib/cni/results
. Or, if this does happen, perhapscontainerd
could catch this and remove the bad cache file before attempting to read from it.What version of containerd are you using?
Any other relevant information
CNI Plugin is https://github.com/aws/amazon-vpc-cni-k8s
v1.12.0
Kubernetes version:
Show configuration if it is related to CRI plugin.
The text was updated successfully, but these errors were encountered: