Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman: add new cgroup mode split #6666

Merged
merged 1 commit into from
Jun 29, 2020

Conversation

giuseppe
Copy link
Member

@giuseppe giuseppe commented Jun 18, 2020

add a new cgroup mode split.

When running under systemd there is no need to create yet another cgroup for the container.

With conmon-delegated the current cgroup will be split in two sub cgroups:

- supervisor
- container

The supervisor cgroup will hold conmon and the podman process, while the container cgroup is used by the OCI runtime (using the cgroupfs backend).

Closes: #6400

Signed-off-by: Giuseppe Scrivano [email protected]

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 18, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2020
@giuseppe
Copy link
Member Author

@vrothberg PTAL

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit messages states conmon-delegated but the code adds a no-conmon-delegated. I'd guess we should update the docs as well.

Do we need to update the systemd unit files in pkg/system/generate?

@@ -1505,6 +1506,17 @@ func (c *Container) getOCICgroupPath() (string, error) {
case (rootless.IsRootless() && !unified) || c.config.NoCgroups:
return "", nil
case c.runtime.config.Engine.CgroupManager == config.SystemdCgroupsManager:
if c.config.CgroupsMode == "no-conmon-delegated" {
Copy link
Member

@vrothberg vrothberg Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make "no-conmon-delegated" a constant somewhere?

Copy link
Member Author

@giuseppe giuseppe Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit messages states conmon-delegated but the code adds a no-conmon-delegated. I'd guess we should update the docs as well.

Thanks, I've renamed it to conmon-delegated and added a constant

Do we need to update the systemd unit files in pkg/system/generate?

I don't think we can do that yet. crun needs containers/crun#409

@giuseppe giuseppe force-pushed the conmon-delegate branch 3 times, most recently from 0148073 to 8832f34 Compare June 18, 2020 12:35

Determines whether the container will create CGroups.

Default is **enabled**. The **disabled** option will force the container
to not create CGroups, and thus conflicts with CGroup options
(**--cgroupns** and **--cgroup-parent**).
The **no-conmon** option disables a new CGroup only for the **conmon** process.
The **conmon-delegated** option reuses the current cgroup for both the conmon and the container payload, it works only on cgroup v2.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's also specific to systemd cgroups (looks like it is) we should also mention that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure it is so specific to systemd. It can be used also without systemd, even if it is less useful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will also break remote. We probably need a check to not permit the option being used as remote as it changes the cgroup of the Podman process itself

@mheon
Copy link
Member

mheon commented Jun 18, 2020

Overall, I like this. I think it does break podman stats/podman top though, and it probably doesn't need to - we just need to look in the current cgroup instead

@giuseppe giuseppe force-pushed the conmon-delegate branch 3 times, most recently from e168a7c to 173e414 Compare June 18, 2020 14:44
@goochjj
Copy link
Contributor

goochjj commented Jun 18, 2020

Is there a reason why you did cgroups 2 only? This approach works for cgroups 1:
goochjj@3d6ec20

I'd happily make this coexist if need be.

@goochjj
Copy link
Contributor

goochjj commented Jun 18, 2020

For that matter, why is cgroup-parent so draconian? It only allows items to end in .slice.

And yet this works:
"cgroupsPath": "/machine.slice/rwhod.service/container",

in config.json

Container cgroups could be implemented with --cgroup-parent /machine.slice/%n/container, if the cgroup-parent check weren't rejecting it. This commit proves it works.

Would a better implementation be implementing --conmon-cgroup-parent /machine.slice/%n/supervisor?

@giuseppe
Copy link
Member Author

Container cgroups could be implemented with --cgroup-parent /machine.slice/%n/container, if the cgroup-parent check weren't rejecting it. This commit proves it works.

we could relax them.

Would a better implementation be implementing --conmon-cgroup-parent /machine.slice/%n/supervisor?

The current implementation works well also without systemd, i.e. you have a cgroup you own and you want to create both conmon+container cgroup in it.

@giuseppe
Copy link
Member Author

Is there a reason why you did cgroups 2 only? This approach works for cgroups 1:
goochjj@3d6ec20

I'd happily make this coexist if need be.

just simplicity. We can surely add cgroup v1 support. It must be done for each controller separately, it is not enough to get the cgroup from the pid controller

@goochjj
Copy link
Contributor

goochjj commented Jun 18, 2020

Is there a reason why you did cgroups 2 only? This approach works for cgroups 1:
goochjj@3d6ec20
I'd happily make this coexist if need be.

just simplicity. We can surely add cgroup v1 support. It must be done for each controller separately, it is not enough to get the cgroup from the pid controller

I read the cgroup from the pid controller, but in the Move call, I move all of them... that should do it, right?

I don't know if pids will always be 3 - I can fix the find if need be.

@giuseppe
Copy link
Member Author

I read the cgroup from the pid controller, but in the Move call, I move all of them... that should do it, right?

I don't know if pids will always be 3 - I can fix the find if need be.

there could be different cgroups for each controller, the safest is to read each controller and then do the move as part of the same loop for that controller

@goochjj
Copy link
Contributor

goochjj commented Jun 18, 2020

I read the cgroup from the pid controller, but in the Move call, I move all of them... that should do it, right?
I don't know if pids will always be 3 - I can fix the find if need be.

there could be different cgroups for each controller, the safest is to read each controller and then do the move as part of the same loop for that controller

... except you were using the same call to determine the name of the supervisor cgroup - so I could only return 1

So should the supervisor cgroup name be based on the pids namespace, and the move do something else? i.e. just tack "container" onto the end? tack "container" onto the end if it's not /?

@goochjj
Copy link
Contributor

goochjj commented Jun 18, 2020

@giuseppe
giuseppe#2

A couple thoughts here:

  1. The cgroup of the container is handled by the OCI runtime, as passed into the bundle. All conmon-delegated does is ensure that cgroup passed in the spec is "/container", which needs an additional function (GetPidCgroup()) to determine what cgroups we were passed.
  2. The cgroup of the conmon process is already handled in oci_conmon_linux... So the new function to move to a subtree shouldn't be necessary... The strategy also uses a "cgroupfs" method for moving the Pids, not a "lets create a systemd transition unit to do it" method. That code already exists in oci_conmon_linux, it just needs the same as above - use of GetPidCgroup to construct the "/supervisor" cgroup path. And the cgroups package already deals with "do all the controllers" logic.

@goochjj
Copy link
Contributor

goochjj commented Jun 18, 2020

Overall, I like this. I think it does break podman stats/podman top though, and it probably doesn't need to - we just need to look in the current cgroup instead

FYI top is working for me - it properly shows the /slice/unit/container cgroup

Stats is not.

@giuseppe
Copy link
Member Author

1. The cgroup of the container is handled by the OCI runtime, as passed into the bundle.  All conmon-delegated does is ensure that cgroup passed in the spec is "/container", which needs an additional function (GetPidCgroup()) to determine what cgroups we were passed.

2. The cgroup of the conmon process is already handled in oci_conmon_linux... So the new function to move to a subtree shouldn't be necessary... The strategy also uses a "cgroupfs" method for moving the Pids, not a "lets create a systemd transition unit to do it" method.  That code already exists in oci_conmon_linux, it just needs the same as above - use of GetPidCgroup to construct the "/supervisor" cgroup path.  And the cgroups package already deals with "do all the controllers" logic.

for cgroup v2 there cannot be processes in the inner groups. We need to make sure the conmon process is already in the right sub cgroup before it starts (or have a way to communicate to conmon to move itself). We need to do such steps before setting up the container, the previous code path cannot be used.

... except you were using the same call to determine the name of the supervisor cgroup - so I could only return 1

there is only one hierarchy in cgroup v2, so that is enough.

In any case, I think I can just make my PR more generic to handle cgroup v1 too, even if it is not necessary to move the conmon process in a sub cgroup

@goochjj
Copy link
Contributor

goochjj commented Jun 18, 2020

pr updated w/ podman stats working

I'll be the first to admit, I don't fully understand all the implications, and I don't have a cgv2 machine to test on. I do know this all works on cgv1, and it looks simpler and reuses code paths that were already present and seem to make sense.

But I don't see how podman stats is going to work if you never pass the cgroup parent back through the container object...

Thanks!

@goochjj
Copy link
Contributor

goochjj commented Jun 18, 2020

FCOS appears to be running cg1/2 hybrid, and pkg/cgroups doesn't adjust /sys/fs/cgroup/unified.

Is the long term plan to fix pkg/cgroups or deprecate it?

@goochjj
Copy link
Contributor

goochjj commented Jun 19, 2020

LGTM

@goochjj
Copy link
Contributor

goochjj commented Jun 19, 2020

@mheon @vrothberg PTAL

@mheon
Copy link
Member

mheon commented Jun 19, 2020

We want to hold this until Monday - we're in a semi-feature-freeze state for v2.0.0 release, which should happen in a few hours. Will do a full review once that is cut.

@@ -80,7 +80,7 @@ If the host uses cgroups v1, the default is set to **host**. On cgroups v2 the
Determines whether the container will create CGroups.
Valid values are *enabled*, *disabled*, *no-conmon*, which the default being *enabled*.
The *disabled* option will force the container to not create CGroups, and thus conflicts with CGroup options (**--cgroupns** and **--cgroup-parent**).
The *no-conmon* option disables a new CGroup only for the conmon process.
The *no-conmon* option disables a new CGroup only for the conmon process. The *split-current* option reuses the current cgroup for both the conmon and the container payload.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call this current? I don't see where split- adds any additional information to the user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may need current in future to really mean current and use the cgroup where Podman is running instead of creating two sub-cgroups

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse-current or perhaps just reuse then?

docs/source/markdown/podman-run.1.md Outdated Show resolved Hide resolved
libpod/oci_conmon.go Outdated Show resolved Hide resolved
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@giuseppe giuseppe changed the title podman: add new cgroup mode split-current podman: add new cgroup mode split Jun 25, 2020
@mheon
Copy link
Member

mheon commented Jun 25, 2020

Can you add the cgroup-parent validation to validate() in libpod/container_validate.go?

When running under systemd there is no need to create yet another
cgroup for the container.

With conmon-delegated the current cgroup will be split in two sub
cgroups:

- supervisor
- container

The supervisor cgroup will hold conmon and the podman process, while
the container cgroup is used by the OCI runtime (using the cgroupfs
backend).

Closes: containers#6400

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member Author

Can you add the cgroup-parent validation to validate() in libpod/container_validate.go?

added and pushed a new version

@giuseppe
Copy link
Member Author

ready for review

@mheon
Copy link
Member

mheon commented Jun 29, 2020

LGTM

1 similar comment
@goochjj
Copy link
Contributor

goochjj commented Jun 29, 2020

LGTM

The **no-conmon** option disables a new CGroup only for the **conmon** process.
The **split** option splits the current cgroup in two sub-cgroups: one for conmon and one for the container payload. It is not possible to set **--cgroup-parent** with **split**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really, really, really need to create a conmon man page that we can add a link to for pages like this. @haircommander any chance of making this happen, even if the page is terribly short and sweet?

This is not a holdback for this review at this time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@TomSweeneyRedHat TomSweeneyRedHat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mheon
Copy link
Member

mheon commented Jun 29, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 29, 2020
@openshift-merge-robot openshift-merge-robot merged commit 6ac009d into containers:master Jun 29, 2020
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 24, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pursuing conventional systemd+podman interaction
9 participants