-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor cgroup and kubelet enforceNodeAllocatable #10714
Open
VannTen
wants to merge
5
commits into
kubernetes-sigs:master
Choose a base branch
from
VannTen:cleanup/cgroup_hierarchy
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+113
−163
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
70920af
Don't manually create *_reserved cgroups
VannTen cac814c
Refactor cgroup hierarchy handling and resource reservation
VannTen faed356
kubelet: Fix semantics for *ReservedCgroup and enforceNodeAllocatable
VannTen 0ade7c3
Rework documentation
VannTen ebbbd21
CI: test several cases for cgroups resources enforcements
VannTen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,73 +1,42 @@ | ||
# cgroups | ||
|
||
To avoid resource contention between containers and host daemons in Kubernetes, the kubelet components can use cgroups to limit resource usage. | ||
To avoid resource contention between containers and host daemons in Kubernetes, | ||
the kubelet components can use cgroups to limit resource usage. | ||
|
||
## Enforcing Node Allocatable | ||
## Node Allocatable | ||
|
||
You can use `kubelet_enforce_node_allocatable` to set node allocatable enforcement. | ||
Node Allocatable is calculated by subtracting from the node capacity: | ||
|
||
```yaml | ||
# A comma separated list of levels of node allocatable enforcement to be enforced by kubelet. | ||
kubelet_enforce_node_allocatable: "pods" | ||
# kubelet_enforce_node_allocatable: "pods,kube-reserved" | ||
# kubelet_enforce_node_allocatable: "pods,kube-reserved,system-reserved" | ||
``` | ||
|
||
Note that to enforce kube-reserved or system-reserved, `kube_reserved_cgroups` or `system_reserved_cgroups` needs to be specified respectively. | ||
- kube-reserved reservations | ||
- system-reserved reservations | ||
- hard eviction thresholds | ||
|
||
Here is an example: | ||
You can set those reservations: | ||
|
||
```yaml | ||
kubelet_enforce_node_allocatable: "pods,kube-reserved,system-reserved" | ||
|
||
# Set kube_reserved to true to run kubelet and container-engine daemons in a dedicated cgroup. | ||
# This is required if you want to enforce limits on the resource usage of these daemons. | ||
# It is not required if you just want to make resource reservations (kube_memory_reserved, kube_cpu_reserved, etc.) | ||
kube_reserved: true | ||
kube_reserved_cgroups_for_service_slice: kube.slice | ||
kube_reserved_cgroups: "/{{ kube_reserved_cgroups_for_service_slice }}" | ||
kube_memory_reserved: 256Mi | ||
kube_cpu_reserved: 100m | ||
# kube_ephemeral_storage_reserved: 2Gi | ||
# kube_pid_reserved: "1000" | ||
# Reservation for master hosts | ||
kube_master_memory_reserved: 512Mi | ||
kube_master_cpu_reserved: 200m | ||
# kube_master_ephemeral_storage_reserved: 2Gi | ||
# kube_master_pid_reserved: "1000" | ||
kube_ephemeral_storage_reserved: 2Gi | ||
kube_pid_reserved: "1000" | ||
|
||
# Set to true to reserve resources for system daemons | ||
system_reserved: true | ||
system_reserved_cgroups_for_service_slice: system.slice | ||
system_reserved_cgroups: "/{{ system_reserved_cgroups_for_service_slice }}" | ||
# System daemons (sshd, network manager, ...) | ||
system_memory_reserved: 512Mi | ||
system_cpu_reserved: 500m | ||
# system_ephemeral_storage_reserved: 2Gi | ||
# system_pid_reserved: "1000" | ||
# Reservation for master hosts | ||
system_master_memory_reserved: 256Mi | ||
system_master_cpu_reserved: 250m | ||
# system_master_ephemeral_storage_reserved: 2Gi | ||
# system_master_pid_reserved: "1000" | ||
system_ephemeral_storage_reserved: 2Gi | ||
system_pid_reserved: "1000" | ||
``` | ||
After the setup, the cgroups hierarchy is as follows: | ||
By default, the kubelet will enforce Node Allocatable for pods, which means | ||
pods will be evicted when resource usage excess Allocatable. | ||
You can optionnaly enforce the reservations for kube-reserved and | ||
system-reserved, but proceed with caution (see [the kubernetes | ||
guidelines](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#general-guidelines)). | ||
```bash | ||
/ (Cgroups Root) | ||
├── kubepods.slice | ||
│ ├── ... | ||
│ ├── kubepods-besteffort.slice | ||
│ ├── kubepods-burstable.slice | ||
│ └── ... | ||
├── kube.slice | ||
│ ├── ... | ||
│ ├── {{container_manager}}.service | ||
│ ├── kubelet.service | ||
│ └── ... | ||
├── system.slice | ||
│ └── ... | ||
└── ... | ||
```yaml | ||
enforce_allocatable_pods: true # default | ||
enforce_allocatable_kube_reserved: true | ||
enforce_allocatable_system_reseverd: true | ||
``` | ||
You can learn more in the [official kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
--- | ||
kube_slice_cgroup: "/{{ kube_slice.split('-') | join('.slice/') }}/" | ||
system_slice_cgroup: "/{{ system_slice.split('-') | join('.slice/') }}/" | ||
enforce_node_allocatable_stub: | ||
pods: "{{ enforce_allocatable_pods }}" | ||
kube-reserved: "{{ enforce_allocatable_kube_reserved }}" | ||
system-reserved: "{{ enforce_allocatable_system_reserved }}" | ||
enforce_node_allocatable: "{{ enforce_node_allocatable_stub | dict2items | selectattr('value') | map(attribute='key') }}" |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those seems to be a bit weird
kube_slice
andsystem_slice
are equal respectively torumtime.slice
andsystem.slice
so afaiu this would be the same askube_slice
andsystem_slice
atm?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly.
It would be
/runtime.slice/
and/system.slice/
.This is intented to make the translation from systemd slice units to the correspond cgroup tree in
/sys/fs/cgroup/
so
/sys/fs/cgroup/nested.slice/nested-runtime.slice`
runtime.slice
->/sys/fs/cgroup/runtime.slice/
nested-runtime.slice ->
/sys/fs/cgroup/nested.slice/runtime.sliceBut I'm just seeing now that this does not work exactly that way, instead it should be
nested-runtime.sliceI'll go fix that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes indeed forgot the
/
but I was more wondering about the split/join which doesn't do anything here AFAIU (?)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't do anything on slice immediately under the root slice (like
kube.slice
).However, if a kubespray user define
kube_slice: orchestrator-kube.slice
the corresponding cgroup will be/sys/fs/cgroup/orchestrator.slice/orchestrator-kube.slice/
(I had mistakenly assumed it wasorchestrator.slice/kube.slice/
, which the current code reflects.)However, I need to do more research on that. It seems that maybe the *ReservedCgroup setting are interpreted differently depending on the cgroupDriver used (
cgroupfs/systemd
) and in systemd case the translation from slice to cgroup would be done directly by the kubelet. Not completely sure though so I've asked on slack sig-node ( https://kubernetes.slack.com/archives/C0BP8PW9G/p1729771784322639 )