Refactor cgroup and kubelet enforceNodeAllocatable #10714

VannTen · 2023-12-12T14:56:43Z

What type of PR is this?
/kind design
Documentation/Bug are probably relevant as well

What this PR does / why we need it:
#9209 introduced knobs for using https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

However, IMO, it leaks a bunch of stuff to the user that it should not.

This is an attempt to streamline the configuration and reduce variance of the jinja templates.

See the commits message for more detailed explanation.

closes #8870 (this one is a superset)
Require #10643
Also, this should fix journactl -u kubelet not getting kubelet logs (because kubelet jumped out of the service cgroup)

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

action required
`kubelet_enforce_node_allocatable` is removed, use instead `enforce_allocatable_{pods,kube_reserved,system_reserved}`
`system_reserved` and `kube_reserved` are removed, use instead `enforce_allocatable_{kube,system}_reserved`
The kubelet and container manager service will by default run in the `runtime.slice` 
{kube,system}_reserved_cgroups and {kube,system}_reserved_cgroups_for_service are removed, use `{kube,system}_slice` to customize their slice
The system slice is no longer created by kubelet on startup
The kube slice creation is delegated to systemd
The system and kube slice are now indifferent to the cgroup driver used
For cri-o with systemd cgroup-driver, use `kube_slice` variable for conmon_cgroup instead of "system.slice"

k8s-ci-robot · 2023-12-12T14:56:46Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

VannTen · 2023-12-12T14:58:25Z

/cc @MrFreezeex
(as you fixed some stuff on that)
and
/cc @shelmingsong
(original author)

k8s-ci-robot · 2023-12-12T14:58:29Z

@VannTen: GitHub didn't allow me to request PR reviews from the following users: shelmingsong.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @MrFreezeex
(as you fixed some stuff on that)
and
/cc @shelmingsong
(original author)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

MrFreezeex · 2023-12-12T15:44:23Z

/cc @MrFreezeex (as you fixed some stuff on that) and /cc @shelmingsong (original author)

From a rough look it looks nice, I will probably take a deeper look when the other PR would be merged. If you could make sure that the kubelet is not complaining at runtime about the configured cgroup slice it would be nice (that the was the issue I fixed).

VannTen · 2023-12-12T16:52:40Z

Forget to mention that this require #10643 (it's on top of it)

k8s-triage-robot · 2024-04-22T17:41:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

VannTen · 2024-04-22T17:57:00Z

/remove-lifecycle stale /lifecycle frozen

k8s-ci-robot · 2024-04-22T17:57:03Z

@VannTen: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

/remove-lifecycle stale
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rptaylor · 2024-07-09T01:13:54Z

#9692 may be relevant to consider here.

VannTen · 2024-08-27T10:30:38Z

I skimmed that issue, but it might indeed. IMO our current variables are not super clear, and don't distinguish clearly between reserving resources (aka, reducing allocatable) and enforcing (add hard limits on the cgroup slice for system / kubelet). This PR does not fundamentally change how it works, but it should be easier for users to navigate ; at least that's the goal.

k8s-ci-robot · 2024-08-29T09:58:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VannTen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [VannTen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

VannTen · 2024-08-29T09:58:47Z

/ok-to-test

rptaylor · 2024-08-29T17:17:35Z

IMO our current variables are not super clear, and don't distinguish clearly between reserving resources (aka, reducing allocatable) and enforcing (add hard limits on the cgroup slice for system / kubelet).

Yes exactly. As far as documentation goes, please note #11367 fixed the behaviour of kube_reserved which was introduced by #9209, but now the behaviour doesn't really match the variable name. Strictly speaking, resource reservation always occurs, but if kube_reserved is true, enforcement of resource limits is applied on the kubelet and container engine.

It looks like this PR renames the kube_reserved variable to enforce_allocatable_kube_reserved ? That is more clear. Thanks!

VannTen · 2024-08-30T07:21:54Z

Yes it does. I also put the kubelet and runtime in a dedicated cgroup **always**, because I don't see a reason for making that customizable, since what matters is the enforceNodeAllocatable config for whether or not those cgroups have hard limits.

VannTen · 2024-08-30T12:57:50Z

Anyone has experience on crio ?

I'm trying to figure what conmon_cgroup is supposed to match, I'm pretty sure it should not be kube_slice (aka the cgroup for kubelet + container engine) but the slice where the pods are going (kubepods.slice) but I can't find a definitive answer on that

VannTen · 2024-09-23T13:53:04Z

/unhold
Now that #10643 has been merged

For the kubelet slice, we let systemd do it for us by specifying a slice int the unit file; it's implicitly created on service start. For the system slice, it's not the kubelet responsibility to create it. See https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#system-reserved , which explicitly tell "Note that kubelet does not create --system-reserved-cgroup if it doesn't exist". systemd takes care of creating that for us, we only have to point the kubelet to it if needed.

VannTen · 2024-10-21T13:40:33Z

So, reading cri-o/cri-o#2248, conmon_cgroup is largely a parrallel concern to kube_reserved and friends. It's per pod, so basically it's part of PodOverhead.
I'm gonna scrape that config entirely from CRI-O, given it should not be tied to kube_reserved in any way AFAIK.
If someone wants to add a way to specifically configure that feature of CRI-O, it should be another PR

Hum. I'm not completely sure after reading some issues in cri-o repo. Let's stick to pod for cgroupfs and slice for systemd. (using kube_slice rather than 'system.slice` since this is more about the runtime that the "other node daemons"

VannTen · 2024-10-21T14:11:45Z

/cc @MrFreezeex
PTAL

* We don't need to organize the cgroup hierarchy differently if we don't use the resources reservation, so remove the variance, always place the kubelet at the same place (default to /runtime.slice/kubelet.service) * Same for the container "runtimes" (which means in fact the container **engines**, aka containerd, cri-o, not runc or kata) * Accordingly, there is no need for a lot of customization on the cgroup hierarchy, so reduce it to `kube_slice` and `system_slice`. All the rest is derived from that and not user-modifiable. * Correct the semantics of kube_reserved and system_reserved: - kube-reserved and systemd-reserved do not guarantee on their own that resources will be available for the respective cgroups, they allow to calculate NodeAllocatable. See https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable

* Setting the {kube,system}ReservedCgroup does not make the kubelet enforce the limits, adding the corresponding entry in enforceNodeAllocatable does. - more explicit variable names - add a warning for enforcing kube and system limits. * Streamline resource kubelet resource reservation: - remove "master" variants: those should be handled by group_vars - Use emtpy defaults to leave them to kubelet default configuration * Exercise the new semantics in CI.

Remove the cgroups schema as it's not really actionable => the link to kubernetes documentation and design doc over here already has that stuff.

MrFreezeex · 2024-10-23T12:14:31Z

roles/kubernetes/node/vars/main.yml

+kube_slice_cgroup: "/{{ kube_slice.split('-') | join('.slice/') }}/"
+system_slice_cgroup: "/{{ system_slice.split('-') | join('.slice/') }}/"


Those seems to be a bit weird kube_slice and system_slice are equal respectively to rumtime.slice and system.slice so afaiu this would be the same as kube_slice and system_slice atm?

Not exactly.
It would be /runtime.slice/ and /system.slice/.
This is intented to make the translation from systemd slice units to the correspond cgroup tree in /sys/fs/cgroup/

so runtime.slice -> /sys/fs/cgroup/runtime.slice/
nested-runtime.slice -> /sys/fs/cgroup/nested.slice/runtime.sliceBut I'm just seeing now that this does not work exactly that way, instead it should benested-runtime.slice /sys/fs/cgroup/nested.slice/nested-runtime.slice`

I'll go fix that

Ah yes indeed forgot the / but I was more wondering about the split/join which doesn't do anything here AFAIU (?)

It doesn't do anything on slice immediately under the root slice (like kube.slice).

However, if a kubespray user define kube_slice: orchestrator-kube.slice the corresponding cgroup will be
/sys/fs/cgroup/orchestrator.slice/orchestrator-kube.slice/ (I had mistakenly assumed it was orchestrator.slice/kube.slice/, which the current code reflects.)

However, I need to do more research on that. It seems that maybe the *ReservedCgroup setting are interpreted differently depending on the cgroupDriver used (cgroupfs/systemd) and in systemd case the translation from slice to cgroup would be done directly by the kubelet. Not completely sure though so I've asked on slack sig-node ( https://kubernetes.slack.com/archives/C0BP8PW9G/p1729771784322639 )

VannTen · 2024-10-23T20:37:31Z

/retest-failed

VannTen · 2024-10-23T21:15:45Z

Relevant : kubernetes/kubernetes#125982

Test the cgroup translation with different container manager and cgroup driver.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/design Categorizes issue or PR as related to design. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 12, 2023

k8s-ci-robot requested review from EppO and holmsten December 12, 2023 14:56

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 12, 2023

k8s-ci-robot requested a review from MrFreezeex December 12, 2023 14:58

VannTen force-pushed the cleanup/cgroup_hierarchy branch from 02b96f5 to 4ea8f19 Compare December 14, 2023 13:49

VannTen mentioned this pull request Dec 21, 2023

Refactor kubernetes/node templates #10643

Merged

VannTen mentioned this pull request Jan 23, 2024

add auto_calculation for system_reserved resources #10830

Closed

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 23, 2024

VannTen mentioned this pull request Feb 6, 2024

Assign cgroups from systemd units for kubelet and container engine #8870

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2024

VannTen force-pushed the cleanup/cgroup_hierarchy branch from 4ea8f19 to 268d635 Compare August 29, 2024 09:57

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2024

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Aug 29, 2024

VannTen force-pushed the cleanup/cgroup_hierarchy branch from 268d635 to faad4b9 Compare August 29, 2024 13:14

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges. labels Aug 29, 2024

VannTen force-pushed the cleanup/cgroup_hierarchy branch from faad4b9 to 6a48ce0 Compare August 29, 2024 13:22

VannTen force-pushed the cleanup/cgroup_hierarchy branch from 6a48ce0 to 1202e19 Compare August 30, 2024 07:17

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2024

VannTen force-pushed the cleanup/cgroup_hierarchy branch 4 times, most recently from 375fc75 to b80680c Compare August 30, 2024 12:54

VannTen force-pushed the cleanup/cgroup_hierarchy branch 3 times, most recently from a959893 to 1c24b56 Compare September 6, 2024 07:40

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2024

VannTen force-pushed the cleanup/cgroup_hierarchy branch from 1c24b56 to 8b6882b Compare September 23, 2024 14:01

VannTen force-pushed the cleanup/cgroup_hierarchy branch from 8b6882b to 2d57864 Compare October 21, 2024 14:04

VannTen added 3 commits October 21, 2024 23:46

Rework documentation

0ade7c3

Remove the cgroups schema as it's not really actionable => the link to kubernetes documentation and design doc over here already has that stuff.

VannTen force-pushed the cleanup/cgroup_hierarchy branch from 2d57864 to 0ade7c3 Compare October 21, 2024 21:46

MrFreezeex reviewed Oct 23, 2024

View reviewed changes

CI: test several cases for cgroups resources enforcements

ebbbd21

Test the cgroup translation with different container manager and cgroup driver.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor cgroup and kubelet enforceNodeAllocatable #10714

Refactor cgroup and kubelet enforceNodeAllocatable #10714

VannTen commented Dec 12, 2023 •

edited

Loading

k8s-ci-robot commented Dec 12, 2023

VannTen commented Dec 12, 2023

k8s-ci-robot commented Dec 12, 2023

MrFreezeex commented Dec 12, 2023

VannTen commented Dec 12, 2023

k8s-triage-robot commented Apr 22, 2024

VannTen commented Apr 22, 2024 via email

k8s-ci-robot commented Apr 22, 2024

rptaylor commented Jul 9, 2024

VannTen commented Aug 27, 2024 via email

k8s-ci-robot commented Aug 29, 2024

VannTen commented Aug 29, 2024

rptaylor commented Aug 29, 2024

VannTen commented Aug 30, 2024 via email

VannTen commented Aug 30, 2024

VannTen commented Sep 23, 2024

VannTen commented Oct 21, 2024 •

edited

Loading

VannTen commented Oct 21, 2024

MrFreezeex Oct 23, 2024

VannTen Oct 23, 2024

MrFreezeex Oct 24, 2024 •

edited

Loading

VannTen Oct 24, 2024

VannTen commented Oct 23, 2024

VannTen commented Oct 23, 2024

		kube_slice_cgroup: "/{{ kube_slice.split('-') \| join('.slice/') }}/"
		system_slice_cgroup: "/{{ system_slice.split('-') \| join('.slice/') }}/"

Refactor cgroup and kubelet enforceNodeAllocatable #10714

Are you sure you want to change the base?

Refactor cgroup and kubelet enforceNodeAllocatable #10714

Conversation

VannTen commented Dec 12, 2023 • edited Loading

k8s-ci-robot commented Dec 12, 2023

VannTen commented Dec 12, 2023

k8s-ci-robot commented Dec 12, 2023

MrFreezeex commented Dec 12, 2023

VannTen commented Dec 12, 2023

k8s-triage-robot commented Apr 22, 2024

VannTen commented Apr 22, 2024 via email

k8s-ci-robot commented Apr 22, 2024

rptaylor commented Jul 9, 2024

VannTen commented Aug 27, 2024 via email

k8s-ci-robot commented Aug 29, 2024

VannTen commented Aug 29, 2024

rptaylor commented Aug 29, 2024

VannTen commented Aug 30, 2024 via email

VannTen commented Aug 30, 2024

VannTen commented Sep 23, 2024

VannTen commented Oct 21, 2024 • edited Loading

VannTen commented Oct 21, 2024

MrFreezeex Oct 23, 2024

Choose a reason for hiding this comment

VannTen Oct 23, 2024

Choose a reason for hiding this comment

MrFreezeex Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

VannTen Oct 24, 2024

Choose a reason for hiding this comment

VannTen commented Oct 23, 2024

VannTen commented Oct 23, 2024

VannTen commented Dec 12, 2023 •

edited

Loading

VannTen commented Oct 21, 2024 •

edited

Loading

MrFreezeex Oct 24, 2024 •

edited

Loading