Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(*initProcess).start: rm second Apply #2446

Merged
merged 1 commit into from
Jun 4, 2020

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Jun 1, 2020

Commit df3fa11 adds a second call to cgroupManager.Apply() method in
order to add a second pid to the cgroup.

While I admit I don't understand the reason why it was added
(even after reading #1916, #1184, and #1884), it seems that
going through all the steps (configure cgroup parameters, figure
out paths to all controllers, create a systemd unit) twice is excessive
and should not be done.

More to say, even merely adding the child pid to the same cgroup seems
redundant, as we added the parent pid to the cgroup before sending the
data to the child (runc init process), and it waits for the data before
doing clone(), so its children will be in the same cgroup anyway.

Needs a very careful review.

@kolyshkin
Copy link
Contributor Author

The thing I do not understand is why Apply() is called twice here:

// Do this before syncing with child so that no children can escape the
// cgroup. We don't need to worry about not doing this and not being root
// because we'd be using the rootless cgroup manager in that case.
if err := p.manager.Apply(p.pid()); err != nil {
return newSystemErrorWithCause(err, "applying cgroup configuration for process")
}
if p.intelRdtManager != nil {
if err := p.intelRdtManager.Apply(p.pid()); err != nil {
return newSystemErrorWithCause(err, "applying Intel RDT configuration for process")
}
}
if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
return newSystemErrorWithCause(err, "copying bootstrap data to pipe")
}
childPid, err := p.getChildPid()
if err != nil {
return newSystemErrorWithCause(err, "getting the final child's pid from pipe")
}
// Save the standard descriptor names before the container process
// can potentially move them (e.g., via dup2()). If we don't do this now,
// we won't know at checkpoint time which file descriptor to look up.
fds, err := getPipeFds(childPid)
if err != nil {
return newSystemErrorWithCausef(err, "getting pipe fds for pid %d", childPid)
}
p.setExternalDescriptors(fds)
// Do this before syncing with child so that no children
// can escape the cgroup
if err := p.manager.Apply(childPid); err != nil {
return newSystemErrorWithCause(err, "applying cgroup configuration for process")
}
if p.intelRdtManager != nil {
if err := p.intelRdtManager.Apply(childPid); err != nil {
return newSystemErrorWithCause(err, "applying Intel RDT configuration for process")
}
}

Presumably it is to fix #1884, and I might buy into adding the second pid, but can't understand why we need to say create the systemd slice for the second time, set all limits for the second time etc.

@cyphar
Copy link
Member

cyphar commented Jun 1, 2020

I noticed the double-Apply recently and I agree that it doesn't make any sense to me. In fact, I'm fairly sure that setting the PID isn't necessary either -- we Apply once before we even send the config data to the runc init process so it hasn't gotten to the point of all the forking garbage in nsexec.c (meaning that all the container processes are in the container). Note that p.pid() is the PID of the runc init process.

In fact, unless I'm misunderstanding the code (which is quite possible -- I didn't get much sleep), all of the CREATECGROUPNS synchronisation is not necessary because we implicitly already put the container process inside the target cgroup before any children were forked. Though I might be wrong about this part -- #1884 was an issue with the old setup (where we only did Apply(p.pid()) without the CREATECGROUPNS synchronisation).

@kolyshkin
Copy link
Contributor Author

I'm fairly sure that setting the PID isn't necessary either -- we Apply once before we even send the config data to the runc init process so it hasn't gotten to the point of all the forking garbage in nsexec.c.

Looks like it is. Trying to remove the second Apply().

Apply() determines and creates cgroup path(s), configures parent cgroups
(for some v1 controllers), and creates a systemd unit (in case of a
systemd cgroup manager), then adds a pid specified to the cgroup
for all configured controllers.

This is a relatively heavy procedure (in particular, for cgroups v1 it
involves parsing /proc/self/mountinfo about a dozen times), and it seems
there is no need to do it twice.

More to say, even merely adding the child pid to the same cgroup seems
redundant, as we added the parent pid to the cgroup before sending the
data to the child (runc init process), and it waits for the data before
doing clone(), so its children will be in the same cgroup anyway.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin changed the title [RFC] cgroupManager: add/use AddPid [RFC] (*initProcess).start: rm second Apply Jun 2, 2020
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Jun 2, 2020

I'm fairly sure that setting the PID isn't necessary either -- we Apply once before we even send the config data to the runc init process so it hasn't gotten to the point of all the forking garbage in nsexec.c.

Looks like it is.

Hmm, OTOH it was the same was at the time of commit df3fa11, and even at the time of commit 69663f0 (mentioned in #1884) so again I am not sure what am I missing here :(

BTW, #1884 (comment) says

When playing with runc code, it appears that the error disappears if we retry running
cgroups.EnterPid on failure (which might indicate this is a transient/race issue).

So, my current (and rather weak) theory is that the second Apply() was just a (wrong, but working) workaround for an issue that is actually fixed later by #1950.

@kolyshkin
Copy link
Contributor Author

OK, I found out that #1184 was not adding the second Apply() but then #1916 (which is a carry of #1184) did. Again, I am not sure why, maybe it was an error during rebase (there was definitely an error of adding a second duplicate defer statement to that function, see discussion in #1916 (comment), fixed in #2238).

@kolyshkin kolyshkin changed the title [RFC] (*initProcess).start: rm second Apply (*initProcess).start: rm second Apply Jun 2, 2020
@kolyshkin
Copy link
Contributor Author

I read the code through a few times and am pretty confident the second apply is useless.

@mrunalp @giuseppe @cyphar @AkihiroSuda PTAL

@mrunalp
Copy link
Contributor

mrunalp commented Jun 4, 2020

@AkihiroSuda @cyphar ptal.

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the second Allow isn't necessary. As I said, I'm also fairly sure that the CREATECGROUPNS synchronisation is also unnecessary -- but I'll submit a follow-up PR which removes that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants