[4.6] kludges for the freeze race #40

kolyshkin · 2021-03-01T22:52:40Z

This backports upstream PRs opencontainers/runc#2774 and opencontainers/runc#2791 to rhaos-4.6 branch, which should fix https://bugzilla.redhat.com/show_bug.cgi?id=1903228

Before this commit, Set() used GetState() to check the freezer state and retry the operation if the actual state still differs from requested. This should help with the situation when a new process (such as one added by runc exec) is added to the container's cgroup while it's being freezed by the kernel, but it's not working as it should. The problem is, GetState() never returns FREEZING state, looping until the state is either FROZEN or THAWED, so Set() does not have a chance to repeate the freeze attempt. As a result, the container might end up stuck in a FREEZING state, with GetState() never returning (which in turn blocks some other operations). One way to fix this would be to have GetState returning FREEZING state instead of retrying ad infinitum. It would result in changing the public API, and no callers of GetState expects it to return this. To fix, let's not use GetState() from Set(). Instead, read the freezer.state file directly and act accordingly -- return success on FROZEN, retry on FREEZING, and error out on any other (unexpected) value. While at it, further improve the code: - limit the number of retries; - if retries are exceeded, thaw and return an error; - don't retry (or read the state back) on THAW. I played a lot with various reproducers for this bug, including - parallel runc execs and runc pause/resumes - parallel runc execs and runc --systemd-cgroup update (the latter performs freeze/unfreeze); - continuously running /bin/printf inside container in parallel with runc pause/resume; - running pthread bomb (from criu test suite) in parallel with runc pause/resume; and I was not able to make freeze work 100%, meaning sometimes runc pause fails, or runc --systemd-cgroup update produces a warning. With that said, it's still a big improvement over the previous state of affairs where container is stuck in FREEZING state, and GetState() (and all its users) are also stuck. Signed-off-by: Kir Kolyshkin <[email protected]> (cherry picked from commit 76ae1f5) Signed-off-by: Kir Kolyshkin <[email protected]>

It appears that briefly thawing the cgroup while freezing greatly increases its chances to freeze successfully. The test case I used is doing runc exec in a look parallel with runc pause/resume in another loop, and the failure to freeze rate reduced from 40 to 0 per minute (tested inside a VM using a busybox container running sleep 1h, doing about 1500 pause/resumes and 650 execs per minute), with max retries being 150 (of 1000). This is still a game of chances, so failures are possible. Signed-off-by: Kir Kolyshkin <[email protected]> (cherry picked from commit d1007b0) Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin added 2 commits March 1, 2021 14:47

kolyshkin changed the base branch from master to rhaos-4.6 March 1, 2021 22:52

mrunalp approved these changes Mar 1, 2021

View reviewed changes

haircommander merged commit 8c2e7c8 into projectatomic:rhaos-4.6 Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4.6] kludges for the freeze race #40

[4.6] kludges for the freeze race #40

kolyshkin commented Mar 1, 2021

[4.6] kludges for the freeze race #40

[4.6] kludges for the freeze race #40

Conversation

kolyshkin commented Mar 1, 2021