(*initProcess).start: rm second Apply #2446

kolyshkin · 2020-06-01T00:28:52Z

Commit df3fa11 adds a second call to cgroupManager.Apply() method in
order to add a second pid to the cgroup.

While I admit I don't understand the reason why it was added
(even after reading #1916, #1184, and #1884), it seems that
going through all the steps (configure cgroup parameters, figure
out paths to all controllers, create a systemd unit) twice is excessive
and should not be done.

More to say, even merely adding the child pid to the same cgroup seems
redundant, as we added the parent pid to the cgroup before sending the
data to the child (runc init process), and it waits for the data before
doing clone(), so its children will be in the same cgroup anyway.

Needs a very careful review.

kolyshkin · 2020-06-01T00:33:25Z

The thing I do not understand is why Apply() is called twice here:

runc/libcontainer/process_linux.go

Lines 322 to 358 in dbe5aca

    
           // Do this before syncing with child so that no children can escape the 
        
           // cgroup. We don't need to worry about not doing this and not being root 
        
           // because we'd be using the rootless cgroup manager in that case. 
        
           if err := p.manager.Apply(p.pid()); err != nil { 
        
           	return newSystemErrorWithCause(err, "applying cgroup configuration for process") 
        
           } 
        
           if p.intelRdtManager != nil { 
        
           	if err := p.intelRdtManager.Apply(p.pid()); err != nil { 
        
           		return newSystemErrorWithCause(err, "applying Intel RDT configuration for process") 
        
           	} 
        
           } 
        
           if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil { 
        
           	return newSystemErrorWithCause(err, "copying bootstrap data to pipe") 
        
           } 
        
           childPid, err := p.getChildPid() 
        
           if err != nil { 
        
           	return newSystemErrorWithCause(err, "getting the final child's pid from pipe") 
        
           } 
        
           // Save the standard descriptor names before the container process 
        
           // can potentially move them (e.g., via dup2()).  If we don't do this now, 
        
           // we won't know at checkpoint time which file descriptor to look up. 
        
           fds, err := getPipeFds(childPid) 
        
           if err != nil { 
        
           	return newSystemErrorWithCausef(err, "getting pipe fds for pid %d", childPid) 
        
           } 
        
           p.setExternalDescriptors(fds) 
        
           // Do this before syncing with child so that no children 
        
           // can escape the cgroup 
        
           if err := p.manager.Apply(childPid); err != nil { 
        
           	return newSystemErrorWithCause(err, "applying cgroup configuration for process") 
        
           } 
        
           if p.intelRdtManager != nil { 
        
           	if err := p.intelRdtManager.Apply(childPid); err != nil { 
        
           		return newSystemErrorWithCause(err, "applying Intel RDT configuration for process") 
        
           	} 
        
           }

Presumably it is to fix #1884, and I might buy into adding the second pid, but can't understand why we need to say create the systemd slice for the second time, set all limits for the second time etc.

cyphar · 2020-06-01T01:33:42Z

I noticed the double-Apply recently and I agree that it doesn't make any sense to me. In fact, I'm fairly sure that setting the PID isn't necessary either -- we Apply once before we even send the config data to the runc init process so it hasn't gotten to the point of all the forking garbage in nsexec.c (meaning that all the container processes are in the container). Note that p.pid() is the PID of the runc init process.

In fact, unless I'm misunderstanding the code (which is quite possible -- I didn't get much sleep), all of the CREATECGROUPNS synchronisation is not necessary because we implicitly already put the container process inside the target cgroup before any children were forked. Though I might be wrong about this part -- #1884 was an issue with the old setup (where we only did Apply(p.pid()) without the CREATECGROUPNS synchronisation).

kolyshkin · 2020-06-02T02:46:06Z

I'm fairly sure that setting the PID isn't necessary either -- we Apply once before we even send the config data to the runc init process so it hasn't gotten to the point of all the forking garbage in nsexec.c.

Looks like it is. Trying to remove the second Apply().

Apply() determines and creates cgroup path(s), configures parent cgroups (for some v1 controllers), and creates a systemd unit (in case of a systemd cgroup manager), then adds a pid specified to the cgroup for all configured controllers. This is a relatively heavy procedure (in particular, for cgroups v1 it involves parsing /proc/self/mountinfo about a dozen times), and it seems there is no need to do it twice. More to say, even merely adding the child pid to the same cgroup seems redundant, as we added the parent pid to the cgroup before sending the data to the child (runc init process), and it waits for the data before doing clone(), so its children will be in the same cgroup anyway. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2020-06-02T03:24:24Z

I'm fairly sure that setting the PID isn't necessary either -- we Apply once before we even send the config data to the runc init process so it hasn't gotten to the point of all the forking garbage in nsexec.c.

Looks like it is.

Hmm, OTOH it was the same was at the time of commit df3fa11, and even at the time of commit 69663f0 (mentioned in #1884) so again I am not sure what am I missing here :(

BTW, #1884 (comment) says

When playing with runc code, it appears that the error disappears if we retry running
cgroups.EnterPid on failure (which might indicate this is a transient/race issue).

So, my current (and rather weak) theory is that the second Apply() was just a (wrong, but working) workaround for an issue that is actually fixed later by #1950.

kolyshkin · 2020-06-02T18:40:06Z

OK, I found out that #1184 was not adding the second Apply() but then #1916 (which is a carry of #1184) did. Again, I am not sure why, maybe it was an error during rebase (there was definitely an error of adding a second duplicate defer statement to that function, see discussion in #1916 (comment), fixed in #2238).

kolyshkin · 2020-06-02T18:44:57Z

I read the code through a few times and am pretty confident the second apply is useless.

@mrunalp @giuseppe @cyphar @AkihiroSuda PTAL

mrunalp · 2020-06-04T14:36:54Z

@AkihiroSuda @cyphar ptal.

cyphar

I agree that the second Allow isn't necessary. As I said, I'm also fairly sure that the CREATECGROUPNS synchronisation is also unnecessary -- but I'll submit a follow-up PR which removes that.

kolyshkin force-pushed the apply-twice branch from 27b9830 to 0b88b16 Compare June 1, 2020 01:32

kolyshkin force-pushed the apply-twice branch from 0b88b16 to 5de2564 Compare June 2, 2020 01:58

kolyshkin force-pushed the apply-twice branch from 5de2564 to d1ba8e3 Compare June 2, 2020 03:06

kolyshkin changed the title ~~[RFC] cgroupManager: add/use AddPid~~ [RFC] (*initProcess).start: rm second Apply Jun 2, 2020

kolyshkin changed the title ~~[RFC] (*initProcess).start: rm second Apply~~ (*initProcess).start: rm second Apply Jun 2, 2020

mrunalp approved these changes Jun 2, 2020

View reviewed changes

cyphar approved these changes Jun 4, 2020

View reviewed changes

cyphar closed this in 2a04669 Jun 4, 2020

cyphar merged commit 2a04669 into opencontainers:master Jun 4, 2020

kolyshkin mentioned this pull request Jun 17, 2020

[DO NOT MERGE] cgroup v1: optimize code to avoid parsing mountinfo if possible #2438

Closed

20 tasks

kolyshkin mentioned this pull request Jul 3, 2020

libct/cgroups/fs: simple optimization #2495

Merged

kolyshkin mentioned this pull request Jul 27, 2021

runc init: remove some code #3113

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(*initProcess).start: rm second Apply #2446

(*initProcess).start: rm second Apply #2446

kolyshkin commented Jun 1, 2020 •

edited

Loading

kolyshkin commented Jun 1, 2020

cyphar commented Jun 1, 2020

kolyshkin commented Jun 2, 2020

kolyshkin commented Jun 2, 2020 •

edited

Loading

kolyshkin commented Jun 2, 2020

kolyshkin commented Jun 2, 2020

mrunalp commented Jun 4, 2020

cyphar left a comment

(*initProcess).start: rm second Apply #2446

(*initProcess).start: rm second Apply #2446

Conversation

kolyshkin commented Jun 1, 2020 • edited Loading

kolyshkin commented Jun 1, 2020

cyphar commented Jun 1, 2020

kolyshkin commented Jun 2, 2020

kolyshkin commented Jun 2, 2020 • edited Loading

kolyshkin commented Jun 2, 2020

kolyshkin commented Jun 2, 2020

mrunalp commented Jun 4, 2020

cyphar left a comment

Choose a reason for hiding this comment

kolyshkin commented Jun 1, 2020 •

edited

Loading

kolyshkin commented Jun 2, 2020 •

edited

Loading