Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry writing to cgroup files on EINTR error #2258

Merged
merged 2 commits into from
Mar 27, 2020

Conversation

danail-branekov
Copy link
Contributor

Golang 1.14 introduces asynchronous preemption which results into
applications getting frequent EINTR (syscall interrupted) errors when
invoking slow syscalls, e.g. when writing to cgroup files.

As writing to cgroups is idempotent, it is safe to retry writing to the
file whenever the write syscall is interrupted.

Note that this PR also bumps the golang version in the test image Dockerfile so that the fix can be verified.

Dockerfile Outdated
@@ -1,4 +1,4 @@
FROM golang:1.12-stretch
FROM golang:1.14-stretch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default should be still kept <= 1.13. But you can override the version to 1.14 with --build-arg in CI.

Golang 1.14 introduces asynchronous preemption which results into
applications getting frequent EINTR (syscall interrupted) errors when
invoking slow syscalls, e.g. when writing to cgroup files.

As writing to cgroups is idempotent, it is safe to retry writing to the
file whenever the write syscall is interrupted.

Signed-off-by: Mario Nitchev <[email protected]>

for i := 0; i < 100000; i++ {
limit := 1024*1024 + i
if err := WriteFile(cgroupPath, "memory.limit_in_bytes", strconv.Itoa(limit)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Skip() on cgroup v2 env

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the skip in the beginning of the test

Signed-off-by: Yulia Nedyalkova <[email protected]>
@AkihiroSuda
Copy link
Member

AkihiroSuda commented Mar 19, 2020

LGTM

Approved with PullApprove

@kolyshkin
Copy link
Contributor

@danail-branekov have you actually seen this (EINTR from ioutils.WriteFile) happening?

I'm asking because Golang 1.14 release notes say

<...> programs that use packages like syscall or golang.org/x/sys/unix will see more slow system calls fail with EINTR errors.

and from this I would assume that if you're not using e.g. write(2) syscall directly, golang will either handle EINTR for you, or avoid it somehow (lock the goroutine doing the write). The question is, is this a wrong assumption?

@kolyshkin
Copy link
Contributor

It seems that

  • write(2) should restart in case SA_RESTART flag is set for signal handlers, and
  • golang sets all signal handlers with SA_RESTART
    so, theoretically, EINTR should not happen during write.

But I don't know for sure.

@mnitchev
Copy link
Contributor

Hello @kolyshkin

Yes, we've been seeing this fail in our CI the past two weeks after an upgrade to Go 1.14. The tests we're running are failing on runc run:

"runc run: exit status 1: container_linux.go:349: starting container process caused \"process_linux.go:449: container init caused \\\"process_linux.go:415: setting cgroup config for procHooks process caused \\\\\\\"failed to write \\\\\\\\\\\\\\\"67108864\\\\\\\\\\\\\\\" to \\\\\\\\\\\\\\\"/sys/fs/cgroup/memory/system.slice/garden.service/garden/f6e3016d-9e99-4b56-43fe-f27453cc5149/memory.memsw.limit_in_bytes\\\\\\\\\\\\\\\": write /sys/fs/cgroup/memory/system.slice/garden.service/garden/f6e3016d-9e99-4b56-43fe-f27453cc5149/memory.memsw.limit_in_bytes: interrupted system call\\\\\\\"\\\"\""

Worth noting is that we've only seen this fail on writing in the memory cgroup.
Regarding the fix. The next line in the go doc you mentioned says: Those programs will have to handle those errors in some way, most likely looping to try the system call again., which is why we assumed we needed to add the retry logic. If you run the test we've added without the retry in fscommon.go it fails consistently.

@kolyshkin
Copy link
Contributor

@mnitchev thank you for the explanation! Reproduced locally, filed an issue to golang: golang/go#38033

@AkihiroSuda
Copy link
Member

ping @opencontainers/runc-maintainers . We need this to unblock migration to Go 1.14.

@hqhq
Copy link
Contributor

hqhq commented Mar 27, 2020

Looks like it's regression of golang, but this work around seems harmless.

@hqhq
Copy link
Contributor

hqhq commented Mar 27, 2020

LGTM

Approved with PullApprove

@hqhq hqhq merged commit d4a6a1d into opencontainers:master Mar 27, 2020
@mnitchev
Copy link
Contributor

Do you plan on releasing with this change soon?

@AkihiroSuda
Copy link
Member

I assume we need changes like this in more places before shipping the next release?

@mnitchev
Copy link
Contributor

mnitchev commented Apr 1, 2020

It's probably worth taking a look, but from our perspective, we've only seen this type of error in this specific place in the codebase (for context our CI runs test every 20 minutes using pretty much all the runc operations). It also looks like the fscommon package is used when writing to all the cgroups.

@AkihiroSuda
Copy link
Member

@cyphar What do you think about the next release?

copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jul 8, 2020
After deploying a version of gvisor built with Go 1.14, we're seeing errors setting up cgroups (we manually run `runsc` via `runsc run`, which creates the cgroup). This turns out to be a known issue with Go: golang/go#38033. Given that the [fix won't be backported](golang/go#39026 (comment)), we should retry writes that may fail with EINTR.

This is also what runc does: opencontainers/runc#2258

FUTURE_COPYBARA_INTEGRATE_REVIEW=#3102 from stripe:andrew/cgroup-eintr 079123b
PiperOrigin-RevId: 320183771
copybara-service bot pushed a commit to google/gvisor that referenced this pull request Jul 28, 2020
After deploying a version of gvisor built with Go 1.14, we're seeing errors setting up cgroups (we manually run `runsc` via `runsc run`, which creates the cgroup). This turns out to be a known issue with Go: golang/go#38033. Given that the [fix won't be backported](golang/go#39026 (comment)), we should retry writes that may fail with EINTR.

This is also what runc does: opencontainers/runc#2258

FUTURE_COPYBARA_INTEGRATE_REVIEW=#3102 from stripe:andrew/cgroup-eintr 079123b
PiperOrigin-RevId: 323575152
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants