Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cgroup and kubelet enforceNodeAllocatable #10714

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

VannTen
Copy link
Contributor

@VannTen VannTen commented Dec 12, 2023

What type of PR is this?
/kind design
Documentation/Bug are probably relevant as well

What this PR does / why we need it:
#9209 introduced knobs for using https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

However, IMO, it leaks a bunch of stuff to the user that it should not.

This is an attempt to streamline the configuration and reduce variance of the jinja templates.

See the commits message for more detailed explanation.

closes #8870 (this one is a superset)
Require #10643
Also, this should fix journactl -u kubelet not getting kubelet logs (because kubelet jumped out of the service cgroup)

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

action required
`kubelet_enforce_node_allocatable` is removed, use instead `enforce_allocatable_{pods,kube_reserved,system_reserved}`
`system_reserved` and `kube_reserved` are removed, use instead `enforce_allocatable_{kube,system}_reserved`
The kubelet and container manager service will by default run in the `runtime.slice` 
{kube,system}_reserved_cgroups and {kube,system}_reserved_cgroups_for_service are removed, use `{kube,system}_slice` to customize their slice
The system slice is no longer created by kubelet on startup
The kube slice creation is delegated to systemd
The system and kube slice are now indifferent to the cgroup driver used
For cri-o with systemd cgroup-driver, use `kube_slice` variable for conmon_cgroup instead of "system.slice" 

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/design Categorizes issue or PR as related to design. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 12, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 12, 2023
@VannTen
Copy link
Contributor Author

VannTen commented Dec 12, 2023

/cc @MrFreezeex
(as you fixed some stuff on that)
and
/cc @shelmingsong
(original author)

@k8s-ci-robot
Copy link
Contributor

@VannTen: GitHub didn't allow me to request PR reviews from the following users: shelmingsong.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @MrFreezeex
(as you fixed some stuff on that)
and
/cc @shelmingsong
(original author)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@MrFreezeex
Copy link
Member

/cc @MrFreezeex (as you fixed some stuff on that) and /cc @shelmingsong (original author)

From a rough look it looks nice, I will probably take a deeper look when the other PR would be merged. If you could make sure that the kubelet is not complaining at runtime about the configured cgroup slice it would be nice (that the was the issue I fixed).

@VannTen
Copy link
Contributor Author

VannTen commented Dec 12, 2023

Forget to mention that this require #10643 (it's on top of it)

@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch from 02b96f5 to 4ea8f19 Compare December 14, 2023 13:49
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 23, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2024
@VannTen
Copy link
Contributor Author

VannTen commented Apr 22, 2024 via email

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2024
@k8s-ci-robot
Copy link
Contributor

@VannTen: The lifecycle/frozen label cannot be applied to Pull Requests.

In response to this:

/remove-lifecycle stale
/lifecycle frozen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rptaylor
Copy link
Contributor

rptaylor commented Jul 9, 2024

#9692 may be relevant to consider here.

@VannTen
Copy link
Contributor Author

VannTen commented Aug 27, 2024 via email

@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch from 4ea8f19 to 268d635 Compare August 29, 2024 09:57
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: VannTen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 29, 2024
@VannTen
Copy link
Contributor Author

VannTen commented Aug 29, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Aug 29, 2024
@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch from 268d635 to faad4b9 Compare August 29, 2024 13:14
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges. labels Aug 29, 2024
@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch from faad4b9 to 6a48ce0 Compare August 29, 2024 13:22
@rptaylor
Copy link
Contributor

IMO our current variables are not super clear, and don't distinguish clearly between reserving resources (aka, reducing allocatable) and enforcing (add hard limits on the cgroup slice for system / kubelet).

Yes exactly. As far as documentation goes, please note #11367 fixed the behaviour of kube_reserved which was introduced by #9209, but now the behaviour doesn't really match the variable name. Strictly speaking, resource reservation always occurs, but if kube_reserved is true, enforcement of resource limits is applied on the kubelet and container engine.

It looks like this PR renames the kube_reserved variable to enforce_allocatable_kube_reserved ? That is more clear. Thanks!

@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch from 6a48ce0 to 1202e19 Compare August 30, 2024 07:17
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2024
@VannTen
Copy link
Contributor Author

VannTen commented Aug 30, 2024 via email

@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch 4 times, most recently from 375fc75 to b80680c Compare August 30, 2024 12:54
@VannTen
Copy link
Contributor Author

VannTen commented Aug 30, 2024

Anyone has experience on crio ?

I'm trying to figure what conmon_cgroup is supposed to match, I'm pretty sure it should not be kube_slice (aka the cgroup for kubelet + container engine) but the slice where the pods are going (kubepods.slice) but I can't find a definitive answer on that

@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch 3 times, most recently from a959893 to 1c24b56 Compare September 6, 2024 07:40
@VannTen
Copy link
Contributor Author

VannTen commented Sep 23, 2024

/unhold
Now that #10643 has been merged

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2024
For the kubelet slice, we let systemd do it for us by specifying a slice
int the unit file; it's implicitly created on service start.

For the system slice, it's not the kubelet responsibility to create it.
See https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#system-reserved ,
which explicitly tell "Note that kubelet does not create
--system-reserved-cgroup if it doesn't exist".

systemd takes care of creating that for us, we only have to point the
kubelet to it if needed.
@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch from 1c24b56 to 8b6882b Compare September 23, 2024 14:01
@VannTen
Copy link
Contributor Author

VannTen commented Oct 21, 2024

So, reading cri-o/cri-o#2248, conmon_cgroup is largely a parrallel concern to kube_reserved and friends. It's per pod, so basically it's part of PodOverhead.
I'm gonna scrape that config entirely from CRI-O, given it should not be tied to kube_reserved in any way AFAIK.
If someone wants to add a way to specifically configure that feature of CRI-O, it should be another PR

Hum. I'm not completely sure after reading some issues in cri-o repo. Let's stick to pod for cgroupfs and slice for systemd. (using kube_slice rather than 'system.slice` since this is more about the runtime that the "other node daemons"

@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch from 8b6882b to 2d57864 Compare October 21, 2024 14:04
@VannTen
Copy link
Contributor Author

VannTen commented Oct 21, 2024

/cc @MrFreezeex
PTAL

* We don't need to organize the cgroup hierarchy differently if we don't
  use the resources reservation, so remove the variance, always place
  the kubelet at the same place (default to
  /runtime.slice/kubelet.service)

* Same for the container "runtimes" (which means in fact the container
  **engines**, aka containerd, cri-o, not runc or kata)

* Accordingly, there is no need for a lot of customization on the cgroup
  hierarchy, so reduce it to `kube_slice` and `system_slice`. All the
  rest is derived from that and not user-modifiable.

* Correct the semantics of kube_reserved and system_reserved:
  - kube-reserved and systemd-reserved do not guarantee on their own
    that resources will be available for the respective cgroups, they
    allow to calculate NodeAllocatable.
    See https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable
* Setting the {kube,system}ReservedCgroup does not make the kubelet
  enforce the limits, adding the corresponding entry in
  enforceNodeAllocatable does.
  - more explicit variable names
  - add a warning for enforcing kube and system limits.

* Streamline resource kubelet resource reservation:
- remove "master" variants: those should be handled by group_vars
- Use emtpy defaults to leave them to kubelet default configuration

* Exercise the new semantics in CI.
Remove the cgroups schema as it's not really actionable => the link to
kubernetes documentation and design doc over here already has that stuff.
@VannTen VannTen force-pushed the cleanup/cgroup_hierarchy branch from 2d57864 to 0ade7c3 Compare October 21, 2024 21:46
Comment on lines +2 to +3
kube_slice_cgroup: "/{{ kube_slice.split('-') | join('.slice/') }}/"
system_slice_cgroup: "/{{ system_slice.split('-') | join('.slice/') }}/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those seems to be a bit weird kube_slice and system_slice are equal respectively to rumtime.slice and system.slice so afaiu this would be the same as kube_slice and system_slice atm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly.
It would be /runtime.slice/ and /system.slice/.
This is intented to make the translation from systemd slice units to the correspond cgroup tree in /sys/fs/cgroup/

so runtime.slice -> /sys/fs/cgroup/runtime.slice/
nested-runtime.slice -> /sys/fs/cgroup/nested.slice/runtime.sliceBut I'm just seeing now that this does not work exactly that way, instead it should benested-runtime.slice /sys/fs/cgroup/nested.slice/nested-runtime.slice`

I'll go fix that

Copy link
Member

@MrFreezeex MrFreezeex Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes indeed forgot the / but I was more wondering about the split/join which doesn't do anything here AFAIU (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't do anything on slice immediately under the root slice (like kube.slice).

However, if a kubespray user define kube_slice: orchestrator-kube.slice the corresponding cgroup will be
/sys/fs/cgroup/orchestrator.slice/orchestrator-kube.slice/ (I had mistakenly assumed it was orchestrator.slice/kube.slice/, which the current code reflects.)

However, I need to do more research on that. It seems that maybe the *ReservedCgroup setting are interpreted differently depending on the cgroupDriver used (cgroupfs/systemd) and in systemd case the translation from slice to cgroup would be done directly by the kubelet. Not completely sure though so I've asked on slack sig-node ( https://kubernetes.slack.com/archives/C0BP8PW9G/p1729771784322639 )

@VannTen
Copy link
Contributor Author

VannTen commented Oct 23, 2024

/retest-failed

@VannTen
Copy link
Contributor Author

VannTen commented Oct 23, 2024

Relevant : kubernetes/kubernetes#125982

Test the cgroup translation with different container manager and cgroup
driver.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-merge Denotes a PR that should use a standard merge by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants