Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resilience in adding of exec tasks to cgroups #1950

Merged
merged 1 commit into from
Feb 1, 2019

Conversation

BooleanCat
Copy link

This is related to #1884

This fix allows more leniency for setting up cgroups on exec when cgroup namespaces are not used.

The issue linked above was partly resolved by #1916 - we can avoid the issue entirely by using cgroup namespaces. Cloud Foundry's use of runc does not yet use cgroup namespaces and we encounter the issue above frequently.

We believe that since not using cgroup namespaces is a valid configuration of a config.json, runc should be more resilient when we choose not to. This PR will attempt to write to cgroup.procs a few times when execing, only retrying on seeing EINVAL (which, unless you're using realtime processes, means the task's state in the kernel is still TASK_NEW: https://elixir.bootlin.com/linux/v4.8/source/kernel/sched/core.c#L8286)

We replaced the ioutil.WriteFile with a Open/Write dance because we don't want to open the file each time we attempt to write to cgroup.procs, and we reduce the surface for error types changing underneath us within golang.

The kernel will sometimes return EINVAL when writing a pid to a
cgroup.procs file. It does so when the task being added still has the
state TASK_NEW.

See: https://elixir.bootlin.com/linux/v4.8/source/kernel/sched/core.c#L8286

Co-authored-by: Danail Branekov <[email protected]>

Signed-off-by: Tom Godkin <[email protected]>
Signed-off-by: Danail Branekov <[email protected]>
BooleanCat pushed a commit to cloudfoundry/garden-runc-release that referenced this pull request Dec 17, 2018
The patch contains the commit from PR opencontainers/runc#1950

[#162607734]

Co-authored-by: Danail Branekov <[email protected]>
@BooleanCat
Copy link
Author

Friendly bump

@BooleanCat
Copy link
Author

@cyphar @crosbymichael Hey friends! Currently Garden is shipping with this in place as a patch since it was causing persistent application push failures in some cloud foundry deployments (which is fixed with this change).

Would be great to get this merge in so we can stop relying on patching runc! Please let me know what you think, thanks!

@crosbymichael
Copy link
Member

Reviewing this today. Sorry, it got lost on me.

Thanks for the ping!

@crosbymichael
Copy link
Member

crosbymichael commented Jan 30, 2019

LGTM

Approved with PullApprove

1 similar comment
@mrunalp
Copy link
Contributor

mrunalp commented Feb 1, 2019

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit 4e4c907 into opencontainers:master Feb 1, 2019
@redbaron
Copy link

Does it solve #1326 ?

@BooleanCat
Copy link
Author

We've had this running in prod for a while now (even as a gift patch before this got merged) in cloud foundry and have not seen it reoccur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants