-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor cgroup and kubelet enforceNodeAllocatable #10714
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
/cc @MrFreezeex |
@VannTen: GitHub didn't allow me to request PR reviews from the following users: shelmingsong. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
From a rough look it looks nice, I will probably take a deeper look when the other PR would be merged. If you could make sure that the kubelet is not complaining at runtime about the configured cgroup slice it would be nice (that the was the issue I fixed). |
Forget to mention that this require #10643 (it's on top of it) |
02b96f5
to
4ea8f19
Compare
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale
/lifecycle frozen
|
@VannTen: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
#9692 may be relevant to consider here. |
I skimmed that issue, but it might indeed. IMO our current variables are
not super clear, and don't distinguish clearly between reserving
resources (aka, reducing allocatable) and enforcing (add hard limits on
the cgroup slice for system / kubelet).
This PR does not fundamentally change how it works, but it should be
easier for users to navigate ; at least that's the goal.
|
4ea8f19
to
268d635
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: VannTen The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
268d635
to
faad4b9
Compare
faad4b9
to
6a48ce0
Compare
Yes exactly. As far as documentation goes, please note #11367 fixed the behaviour of kube_reserved which was introduced by #9209, but now the behaviour doesn't really match the variable name. Strictly speaking, resource reservation always occurs, but if kube_reserved is true, enforcement of resource limits is applied on the kubelet and container engine. It looks like this PR renames the kube_reserved variable to enforce_allocatable_kube_reserved ? That is more clear. Thanks! |
6a48ce0
to
1202e19
Compare
Yes it does.
I also put the kubelet and runtime in a dedicated cgroup **always**,
because I don't see a reason for making that customizable, since what
matters is the enforceNodeAllocatable config for whether or not those
cgroups have hard limits.
|
375fc75
to
b80680c
Compare
Anyone has experience on crio ? I'm trying to figure what conmon_cgroup is supposed to match, I'm pretty sure it should not be |
a959893
to
1c24b56
Compare
/unhold |
For the kubelet slice, we let systemd do it for us by specifying a slice int the unit file; it's implicitly created on service start. For the system slice, it's not the kubelet responsibility to create it. See https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#system-reserved , which explicitly tell "Note that kubelet does not create --system-reserved-cgroup if it doesn't exist". systemd takes care of creating that for us, we only have to point the kubelet to it if needed.
1c24b56
to
8b6882b
Compare
Hum. I'm not completely sure after reading some issues in cri-o repo. Let's stick to pod for cgroupfs and slice for systemd. (using kube_slice rather than 'system.slice` since this is more about the runtime that the "other node daemons" |
8b6882b
to
2d57864
Compare
/cc @MrFreezeex |
* We don't need to organize the cgroup hierarchy differently if we don't use the resources reservation, so remove the variance, always place the kubelet at the same place (default to /runtime.slice/kubelet.service) * Same for the container "runtimes" (which means in fact the container **engines**, aka containerd, cri-o, not runc or kata) * Accordingly, there is no need for a lot of customization on the cgroup hierarchy, so reduce it to `kube_slice` and `system_slice`. All the rest is derived from that and not user-modifiable. * Correct the semantics of kube_reserved and system_reserved: - kube-reserved and systemd-reserved do not guarantee on their own that resources will be available for the respective cgroups, they allow to calculate NodeAllocatable. See https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable
* Setting the {kube,system}ReservedCgroup does not make the kubelet enforce the limits, adding the corresponding entry in enforceNodeAllocatable does. - more explicit variable names - add a warning for enforcing kube and system limits. * Streamline resource kubelet resource reservation: - remove "master" variants: those should be handled by group_vars - Use emtpy defaults to leave them to kubelet default configuration * Exercise the new semantics in CI.
Remove the cgroups schema as it's not really actionable => the link to kubernetes documentation and design doc over here already has that stuff.
2d57864
to
0ade7c3
Compare
kube_slice_cgroup: "/{{ kube_slice.split('-') | join('.slice/') }}/" | ||
system_slice_cgroup: "/{{ system_slice.split('-') | join('.slice/') }}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those seems to be a bit weird kube_slice
and system_slice
are equal respectively to rumtime.slice
and system.slice
so afaiu this would be the same as kube_slice
and system_slice
atm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly.
It would be /runtime.slice/
and /system.slice/
.
This is intented to make the translation from systemd slice units to the correspond cgroup tree in /sys/fs/cgroup/
so runtime.slice
-> /sys/fs/cgroup/runtime.slice/
nested-runtime.slice ->
/sys/fs/cgroup/nested.slice/runtime.sliceBut I'm just seeing now that this does not work exactly that way, instead it should be
nested-runtime.slice
/sys/fs/cgroup/nested.slice/nested-runtime.slice`
I'll go fix that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes indeed forgot the /
but I was more wondering about the split/join which doesn't do anything here AFAIU (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't do anything on slice immediately under the root slice (like kube.slice
).
However, if a kubespray user define kube_slice: orchestrator-kube.slice
the corresponding cgroup will be
/sys/fs/cgroup/orchestrator.slice/orchestrator-kube.slice/
(I had mistakenly assumed it was orchestrator.slice/kube.slice/
, which the current code reflects.)
However, I need to do more research on that. It seems that maybe the *ReservedCgroup setting are interpreted differently depending on the cgroupDriver used (cgroupfs/systemd
) and in systemd case the translation from slice to cgroup would be done directly by the kubelet. Not completely sure though so I've asked on slack sig-node ( https://kubernetes.slack.com/archives/C0BP8PW9G/p1729771784322639 )
/retest-failed |
Relevant : kubernetes/kubernetes#125982 |
Test the cgroup translation with different container manager and cgroup driver.
What type of PR is this?
/kind design
Documentation/Bug are probably relevant as well
What this PR does / why we need it:
#9209 introduced knobs for using https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/
However, IMO, it leaks a bunch of stuff to the user that it should not.
This is an attempt to streamline the configuration and reduce variance of the jinja templates.
See the commits message for more detailed explanation.
closes #8870 (this one is a superset)
Require #10643
Also, this should fix
journactl -u kubelet
not getting kubelet logs (because kubelet jumped out of the service cgroup)Special notes for your reviewer:
Does this PR introduce a user-facing change?: