Skip to content

Commit

Permalink
libct/cg/fs/freezer: fix freezing race
Browse files Browse the repository at this point in the history
Before this commit, Set() used GetState() to check the freezer state
and retry the operation if the actual state still differs from requested.
This should help with the situation when a new process (such as one
added by runc exec) is added to the container's cgroup while it's being
freezed by the kernel, but it's not working as it should.

The problem is, GetState() never returns FREEZING state, looping until
the state is either FROZEN or THAWED, so Set() does not have a chance
to repeate the freeze attempt.

As a result, the container might end up stuck in a FREEZING state,
with GetState() never returning (which in turn blocks some other
operations).

One way to fix this would be to have GetState returning FREEZING state
instead of retrying ad infinitum. It would result in changing the public
API, and no callers of GetState expects it to return this.

To fix, let's not use GetState() from Set(). Instead, read the
freezer.state file directly and act accordingly -- return success
on FROZEN, retry on FREEZING, and error out on any other (unexpected)
value.

While at it, further improve the code:
 - limit the number of retries;
 - if retries are exceeded, thaw and return an error;
 - don't retry (or read the state back) on THAW.

I played a lot with various reproducers for this bug, including

 - parallel runc execs and runc pause/resumes
 - parallel runc execs and runc --systemd-cgroup update
   (the latter performs freeze/unfreeze);
 - continuously running /bin/printf inside container
   in parallel with runc pause/resume;
 - running pthread bomb (from criu test suite) in parallel
   with runc pause/resume;

and I was not able to make freeze work 100%, meaning sometimes
runc pause fails, or runc --systemd-cgroup update produces a warning.

With that said, it's still a big improvement over the previous
state of affairs where container is stuck in FREEZING state,
and GetState() (and all its users) are also stuck.

Signed-off-by: Kir Kolyshkin <[email protected]>
  • Loading branch information
kolyshkin committed Feb 1, 2021
1 parent 6c85f63 commit 76ae1f5
Showing 1 changed file with 35 additions and 14 deletions.
49 changes: 35 additions & 14 deletions libcontainer/cgroups/fs/freezer.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,33 +28,54 @@ func (s *FreezerGroup) Apply(path string, d *cgroupData) error {

func (s *FreezerGroup) Set(path string, cgroup *configs.Cgroup) error {
switch cgroup.Resources.Freezer {
case configs.Frozen, configs.Thawed:
for {
// In case this loop does not exit because it doesn't get the expected
// state, let's write again this state, hoping it's going to be properly
// set this time. Otherwise, this loop could run infinitely, waiting for
// a state change that would never happen.
if err := fscommon.WriteFile(path, "freezer.state", string(cgroup.Resources.Freezer)); err != nil {
case configs.Frozen:
// As per older kernel docs (freezer-subsystem.txt before
// kernel commit ef9fe980c6fcc1821), if FREEZING is seen,
// userspace should either retry or thaw. While current
// kernel cgroup v1 docs no longer mention a need to retry,
// the kernel (tested on v5.4, Ubuntu 20.04) can't reliably
// freeze a cgroup while new processes keep appearing in it
// (either via fork/clone or by writing new PIDs to
// cgroup.procs).
//
// The number of retries below is chosen to have a decent
// chance to succeed even in the worst case scenario (runc
// pause/unpause with parallel runc exec).
//
// Adding any amount of sleep in between retries did not
// increase the chances of successful freeze.
for i := 0; i < 1000; i++ {
if err := fscommon.WriteFile(path, "freezer.state", string(configs.Frozen)); err != nil {
return err
}

state, err := s.GetState(path)
state, err := fscommon.ReadFile(path, "freezer.state")
if err != nil {
return err
}
if state == cgroup.Resources.Freezer {
break
state = strings.TrimSpace(state)
switch state {
case "FREEZING":
continue
case string(configs.Frozen):
return nil
default:
// should never happen
return fmt.Errorf("unexpected state %s while freezing", strings.TrimSpace(state))
}

time.Sleep(1 * time.Millisecond)
}
// Despite our best efforts, it got stuck in FREEZING.
// Leaving it in this state is bad and dangerous, so
// let's (try to) thaw it back and error out.
_ = fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
return errors.New("unable to freeze")
case configs.Thawed:
return fscommon.WriteFile(path, "freezer.state", string(configs.Thawed))
case configs.Undefined:
return nil
default:
return fmt.Errorf("Invalid argument '%s' to freezer.state", string(cgroup.Resources.Freezer))
}

return nil
}

func (s *FreezerGroup) GetStats(path string, stats *cgroups.Stats) error {
Expand Down

0 comments on commit 76ae1f5

Please sign in to comment.