-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic runc exec failures #1884
Comments
Hi guys, We think it's a problem related to the fact that runc is written in go. Specifically, we tracked the issue to the following condition in the /*
* Avoid calling sched_move_task() before wake_up_new_task()
* has happened. This would lead to problems with PELT, due to
* move wanting to detach+attach while we're not attached yet.
*/
if (task->state == TASK_NEW)
ret = -EINVAL; This check was added in kernel version v4.8. What we see from the kernel state, is that the rejected pid belongs to a thread whose parent is the pid that runc is trying to add to the Here is a full example of the repro and kernel state:
And the task states in the kernel are:
@hqhq @crosbymichael:
We would be happy to own this fix, just let us know if you have other ideas in mind. |
Awesome. Another day, another reason why using Go was a mistake. 😉 In all seriousness, yeah this is a pretty bad issue. I think the easiest solution would be to just carry #1184 -- which includes a patch where the cgroup attachment code is done while |
It isn't active, we'd need to carry it. To be honest, the patch was mostly done it just needed some review -- I had quite a few concerns in the early review process but I think the patch just stagnated. I can try to carry it this week and see if it solves the issue for you. |
Hi @cyphar any updates about the fix? Can we assist in any-way? |
Sorry, I've been working on some kernel patches related to container runtime security. I will work on carrying the PR next week. |
I'm going to look into this today and see if we can get a smaller diff to fix the issues you are seeing. |
Hello, I have users that are experiencing pain from this issue as well, do we have any updates on a fix? |
@jaeco I just submited a rebased PR to fix this issue. We were able to reproduce and I can confirm the fix resolves this issue. |
Thank you so much!! |
Fix for opencontainers/runc#1884 is in opencontainers/runc#1916 - problem described in 1884 is independent of kubernetes version. - happens with kernel >= v4.8 - fixed in runc version v1.0.0-rc6 and above ( opencontainers/runc@9a3a8a5 ) - which got pulled into containerd v1.3.0-beta.0 and above ( containerd/containerd@97dd5df#diff-4061fcef378a6d912e14e2ce162a1995)
As of runc 1.0.0-rc5+dev we've started noticing increase rate in sporadic errors during k8s liveness proves.
When the test fails, it produces the following error:
This might be related to #1326 and moby/moby#31230, but without a root cause or a resolution we can't be sure.
When playing with runc code, it appears that the error disappears if we retry running
cgroups.EnterPid
on failure (which might indicate this is a transient/race issue).We've found that it's easier to reproduce this issue if we run
perf
together with excessive docker exec load (we tried this on ubuntu xenial and other OSs).perf trace --no-syscalls --event 'sched:*'
Runc version
The text was updated successfully, but these errors were encountered: