Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cgroup and kubelet enforceNodeAllocatable #10714

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 23 additions & 54 deletions docs/operations/cgroups.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,42 @@
# cgroups

To avoid resource contention between containers and host daemons in Kubernetes, the kubelet components can use cgroups to limit resource usage.
To avoid resource contention between containers and host daemons in Kubernetes,
the kubelet components can use cgroups to limit resource usage.

## Enforcing Node Allocatable
## Node Allocatable

You can use `kubelet_enforce_node_allocatable` to set node allocatable enforcement.
Node Allocatable is calculated by subtracting from the node capacity:

```yaml
# A comma separated list of levels of node allocatable enforcement to be enforced by kubelet.
kubelet_enforce_node_allocatable: "pods"
# kubelet_enforce_node_allocatable: "pods,kube-reserved"
# kubelet_enforce_node_allocatable: "pods,kube-reserved,system-reserved"
```

Note that to enforce kube-reserved or system-reserved, `kube_reserved_cgroups` or `system_reserved_cgroups` needs to be specified respectively.
- kube-reserved reservations
- system-reserved reservations
- hard eviction thresholds

Here is an example:
You can set those reservations:

```yaml
kubelet_enforce_node_allocatable: "pods,kube-reserved,system-reserved"

# Set kube_reserved to true to run kubelet and container-engine daemons in a dedicated cgroup.
# This is required if you want to enforce limits on the resource usage of these daemons.
# It is not required if you just want to make resource reservations (kube_memory_reserved, kube_cpu_reserved, etc.)
kube_reserved: true
kube_reserved_cgroups_for_service_slice: kube.slice
kube_reserved_cgroups: "/{{ kube_reserved_cgroups_for_service_slice }}"
kube_memory_reserved: 256Mi
kube_cpu_reserved: 100m
# kube_ephemeral_storage_reserved: 2Gi
# kube_pid_reserved: "1000"
# Reservation for master hosts
kube_master_memory_reserved: 512Mi
kube_master_cpu_reserved: 200m
# kube_master_ephemeral_storage_reserved: 2Gi
# kube_master_pid_reserved: "1000"
kube_ephemeral_storage_reserved: 2Gi
kube_pid_reserved: "1000"

# Set to true to reserve resources for system daemons
system_reserved: true
system_reserved_cgroups_for_service_slice: system.slice
system_reserved_cgroups: "/{{ system_reserved_cgroups_for_service_slice }}"
# System daemons (sshd, network manager, ...)
system_memory_reserved: 512Mi
system_cpu_reserved: 500m
# system_ephemeral_storage_reserved: 2Gi
# system_pid_reserved: "1000"
# Reservation for master hosts
system_master_memory_reserved: 256Mi
system_master_cpu_reserved: 250m
# system_master_ephemeral_storage_reserved: 2Gi
# system_master_pid_reserved: "1000"
system_ephemeral_storage_reserved: 2Gi
system_pid_reserved: "1000"
```
After the setup, the cgroups hierarchy is as follows:
By default, the kubelet will enforce Node Allocatable for pods, which means
pods will be evicted when resource usage excess Allocatable.
You can optionnaly enforce the reservations for kube-reserved and
system-reserved, but proceed with caution (see [the kubernetes
guidelines](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#general-guidelines)).
```bash
/ (Cgroups Root)
├── kubepods.slice
│ ├── ...
│ ├── kubepods-besteffort.slice
│ ├── kubepods-burstable.slice
│ └── ...
├── kube.slice
│ ├── ...
│ ├── {{container_manager}}.service
│ ├── kubelet.service
│ └── ...
├── system.slice
│ └── ...
└── ...
```yaml
enforce_allocatable_pods: true # default
enforce_allocatable_kube_reserved: true
enforce_allocatable_system_reseverd: true
```
You can learn more in the [official kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/).
46 changes: 17 additions & 29 deletions inventory/sample/group_vars/k8s_cluster/k8s-cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -250,47 +250,35 @@ default_kubelet_config_dir: "{{ kube_config_dir }}/dynamic_kubelet_dir"
# Download kubectl onto the host that runs Ansible in {{ bin_dir }}
# kubectl_localhost: false

# A comma separated list of levels of node allocatable enforcement to be enforced by kubelet.
# Acceptable options are 'pods', 'system-reserved', 'kube-reserved' and ''. Default is "".
# kubelet_enforce_node_allocatable: pods
## Reserving compute resources
# https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

## Set runtime and kubelet cgroups when using systemd as cgroup driver (default)
# kubelet_runtime_cgroups: "/{{ kube_service_cgroups }}/{{ container_manager }}.service"
# kubelet_kubelet_cgroups: "/{{ kube_service_cgroups }}/kubelet.service"

## Set runtime and kubelet cgroups when using cgroupfs as cgroup driver
# kubelet_runtime_cgroups_cgroupfs: "/system.slice/{{ container_manager }}.service"
# kubelet_kubelet_cgroups_cgroupfs: "/system.slice/kubelet.service"

# Whether to run kubelet and container-engine daemons in a dedicated cgroup.
# kube_reserved: false
# Optionally reserve resources for kube daemons.
## Uncomment to override default values
## The following two items need to be set when kube_reserved is true
# kube_reserved_cgroups_for_service_slice: kube.slice
# kube_reserved_cgroups: "/{{ kube_reserved_cgroups_for_service_slice }}"
# kube_memory_reserved: 256Mi
# kube_cpu_reserved: 100m
# kube_ephemeral_storage_reserved: 2Gi
# kube_pid_reserved: "1000"
# Reservation for control plane hosts
# kube_master_memory_reserved: 512Mi
# kube_master_cpu_reserved: 200m
# kube_master_ephemeral_storage_reserved: 2Gi
# kube_master_pid_reserved: "1000"

## Optionally reserve resources for OS system daemons.
# system_reserved: true
## Uncomment to override default values
## The following two items need to be set when system_reserved is true
# system_reserved_cgroups_for_service_slice: system.slice
# system_reserved_cgroups: "/{{ system_reserved_cgroups_for_service_slice }}"
# system_memory_reserved: 512Mi
# system_cpu_reserved: 500m
# system_ephemeral_storage_reserved: 2Gi
## Reservation for master hosts
# system_master_memory_reserved: 256Mi
# system_master_cpu_reserved: 250m
# system_master_ephemeral_storage_reserved: 2Gi
# system_pid_reserved: "1000"
#
# Make the kubelet enforce with cgroups the limits of Pods
# enforce_allocatable_pods: true

# Enforce kube_*_reserved as limits
# WARNING: this limits the resources the kubelet and the container engine can
# use which can cause instability on your nodes
# enforce_allocatable_kube_reserved: false

# Enforce system_*_reserved as limits
# WARNING: this limits the resources system daemons can use which can lock you
# out of your nodes (by OOMkilling sshd for instance)
# enforce_allocatable_system_reserved: false

## Eviction Thresholds to avoid system OOMs
# https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#eviction-thresholds
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,8 @@ LimitMEMLOCK={{ containerd_limit_mem_lock }}
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999
# Set the cgroup slice of the service so that kube reserved takes effect
{% if kube_reserved is defined and kube_reserved|bool %}
Slice={{ kube_reserved_cgroups_for_service_slice }}
{% endif %}
# Set the cgroup slice of the service to optionally enforce resource limitations
Slice={{ kube_slice }}

[Install]
WantedBy=multi-user.target
Original file line number Diff line number Diff line change
Expand Up @@ -35,10 +35,8 @@ LimitCORE=infinity
TasksMax=infinity
Delegate=yes
KillMode=process
# Set the cgroup slice of the service so that kube reserved takes effect
{% if kube_reserved is defined and kube_reserved|bool %}
Slice={{ kube_reserved_cgroups_for_service_slice }}
{% endif %}
# Set the cgroup slice of the service to optionally enforce resource limitations
Slice={{ kube_slice }}

[Install]
WantedBy=multi-user.target
7 changes: 2 additions & 5 deletions roles/container-engine/cri-o/tasks/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -90,19 +90,16 @@
remote_src: true
notify: Restart crio

- name: Cri-o | configure crio to use kube reserved cgroups
- name: Cri-o | configure crio to run in the kube slice
ansible.builtin.copy:
dest: /etc/systemd/system/crio.service.d/00-slice.conf
owner: root
group: root
mode: '0644'
content: |
[Service]
Slice={{ kube_reserved_cgroups_for_service_slice }}
Slice={{ kube_slice }}
notify: Restart crio
when:
- kube_reserved is defined and kube_reserved is true
- kube_reserved_cgroups_for_service_slice is defined

- name: Cri-o | update the bin dir for crio.service file
replace:
Expand Down
6 changes: 1 addition & 5 deletions roles/container-engine/cri-o/templates/crio.conf.j2
Original file line number Diff line number Diff line change
Expand Up @@ -114,11 +114,7 @@ conmon = "{{ crio_conmon }}"
{% if crio_cgroup_manager == "cgroupfs" %}
conmon_cgroup = "pod"
{% else %}
{% if kube_reserved is defined and kube_reserved|bool %}
conmon_cgroup = "{{ kube_reserved_cgroups_for_service_slice }}"
{% else %}
conmon_cgroup = "system.slice"
{% endif %}
conmon_cgroup = "{{ kube_slice }}"
{% endif %}

# Environment variable list for the conmon process, used for passing necessary
Expand Down
6 changes: 2 additions & 4 deletions roles/container-engine/docker/templates/docker.service.j2
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,8 @@ TimeoutStartSec=1min
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
# Set the cgroup slice of the service so that kube reserved takes effect
{% if kube_reserved is defined and kube_reserved|bool %}
Slice={{ kube_reserved_cgroups_for_service_slice }}
{% endif %}
# Set the cgroup slice of the service to optionally enforce resource limitations
Slice={{ kube_slice }}

[Install]
WantedBy=multi-user.target
44 changes: 23 additions & 21 deletions roles/kubernetes/node/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,6 @@ kubelet_bind_address: "{{ ip | default('0.0.0.0') }}"
# resolv.conf to base dns config
kube_resolv_conf: "/etc/resolv.conf"

# Set to empty to avoid cgroup creation
kubelet_enforce_node_allocatable: "\"\""

# Set runtime and kubelet cgroups when using systemd as cgroup driver (default)
kube_service_cgroups: "{% if kube_reserved %}{{ kube_reserved_cgroups_for_service_slice }}{% else %}system.slice{% endif %}"
kubelet_runtime_cgroups: "/{{ kube_service_cgroups }}/{{ container_manager }}.service"
kubelet_kubelet_cgroups: "/{{ kube_service_cgroups }}/kubelet.service"

# Set runtime and kubelet cgroups when using cgroupfs as cgroup driver
kubelet_runtime_cgroups_cgroupfs: "/system.slice/{{ container_manager }}.service"
kubelet_kubelet_cgroups_cgroupfs: "/system.slice/kubelet.service"

# Set systemd service hardening features
kubelet_systemd_hardening: false

Expand All @@ -33,24 +21,38 @@ kube_node_addresses: >-
{%- endfor -%}
kubelet_secure_addresses: "localhost link-local {{ kube_pods_subnet }} {{ kube_node_addresses }}"

# Reserve this space for kube resources
# Whether to run kubelet and container-engine daemons in a dedicated cgroup. (Not required for resource reservations).
kube_reserved: false
kube_reserved_cgroups: "/{{ kube_reserved_cgroups_for_service_slice }}"
## Reserving compute resources
# https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/

# Resource reservations for kube daemons
kube_memory_reserved: "256Mi"
kube_cpu_reserved: "100m"
kube_ephemeral_storage_reserved: "500Mi"
kube_pid_reserved: "1000"
kube_pid_reserved: 1000

# Set to true to reserve resources for system daemons
system_reserved: false
system_reserved_cgroups_for_service_slice: system.slice
system_reserved_cgroups: "/{{ system_reserved_cgroups_for_service_slice }}"
# Set slice for host system daemons (sshd, NetworkManager, ...)
# You probably don't want to change this
system_slice: system.slice

# Resource reservations for system daemons
system_memory_reserved: "512Mi"
system_cpu_reserved: "500m"
system_ephemeral_storage_reserved: "500Mi"
system_pid_reserved: 1000

# Make the kubelet enforce with cgroups the limits of Pods
enforce_allocatable_pods: true

# Enforce kube_*_reserved as limits
# WARNING: this limits the resources the kubelet and the container engine can
# use which can cause instability on your nodes
enforce_allocatable_kube_reserved: false

# Enforce system_*_reserved as limits
# WARNING: this limits the resources system daemons can use which can lock you
# out of your nodes (by OOMkilling sshd for instance)
enforce_allocatable_system_reserved: false

## Eviction Thresholds to avoid system OOMs
# https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#eviction-thresholds
eviction_hard: {}
Expand Down
6 changes: 0 additions & 6 deletions roles/kubernetes/node/tasks/facts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,6 @@
kubelet_cgroup_driver: "{{ kubelet_cgroup_driver_detected }}"
when: kubelet_cgroup_driver is undefined

- name: Set kubelet_cgroups options when cgroupfs is used
set_fact:
kubelet_runtime_cgroups: "{{ kubelet_runtime_cgroups_cgroupfs }}"
kubelet_kubelet_cgroups: "{{ kubelet_kubelet_cgroups_cgroupfs }}"
when: kubelet_cgroup_driver == 'cgroupfs'

- name: Set kubelet_config_extra_args options when cgroupfs is used
set_fact:
kubelet_config_extra_args: "{{ kubelet_config_extra_args | combine(kubelet_config_extra_args_cgroupfs) }}"
Expand Down
13 changes: 6 additions & 7 deletions roles/kubernetes/node/templates/kubelet-config.v1beta1.yaml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,13 @@ authorization:
{% else %}
mode: AlwaysAllow
{% endif %}
{% if kubelet_enforce_node_allocatable is defined and kubelet_enforce_node_allocatable != "\"\"" %}
{% set kubelet_enforce_node_allocatable_list = kubelet_enforce_node_allocatable.split(",") %}
enforceNodeAllocatable:
{% for item in kubelet_enforce_node_allocatable_list %}
{% if enforce_node_allocatable %}
{% for item in enforce_node_allocatable %}
- {{ item }}
{% endfor %}
{% else %}
- none # don't enforce anything
{% endif %}
staticPodPath: {{ kube_manifest_dir }}
cgroupDriver: {{ kubelet_cgroup_driver | default('systemd') }}
Expand All @@ -33,7 +34,7 @@ address: {{ kubelet_bind_address }}
readOnlyPort: {{ kube_read_only_port }}
healthzPort: {{ kubelet_healthz_port }}
healthzBindAddress: {{ kubelet_healthz_bind_address }}
kubeletCgroups: {{ kubelet_kubelet_cgroups }}
kubeletCgroups: {{ kube_slice_cgroup ~ 'kubelet.service' }}
clusterDomain: {{ dns_domain }}
{% if kubelet_protect_kernel_defaults | bool %}
protectKernelDefaults: true
Expand Down Expand Up @@ -62,9 +63,7 @@ clusterDNS:
{% endfor %}
{# Node reserved CPU/memory #}
{% for scope in "kube", "system" %}
{% if lookup('ansible.builtin.vars', scope + "_reserved") | bool %}
{{ scope }}ReservedCgroup: {{ lookup('ansible.builtin.vars', scope + '_reserved_cgroups') }}
{% endif %}
{{ scope }}ReservedCgroup: {{ lookup('ansible.builtin.vars', scope + '_slice_cgroup') }}
{{ scope }}Reserved:
{% for resource in "cpu", "memory", "ephemeral-storage", "pid" %}
{{ resource }}: "{{ lookup('ansible.builtin.vars', scope + '_' ~ (resource | replace('-', '_')) + '_reserved') }}"
Expand Down
2 changes: 1 addition & 1 deletion roles/kubernetes/node/templates/kubelet.env.v1beta1.j2
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ KUBELET_HOSTNAME="--hostname-override={{ kube_override_hostname }}"
--config={{ kube_config_dir }}/kubelet-config.yaml \
--kubeconfig={{ kube_config_dir }}/kubelet.conf \
{# end kubeadm specific settings #}
--runtime-cgroups={{ kubelet_runtime_cgroups }} \
--runtime-cgroups={{ kube_slice_cgroup ~ container_manager ~ '.service' }} \
{% endset %}

KUBELET_ARGS="{{ kubelet_args_base }} {{ kubelet_custom_flags | join(' ') }}"
Expand Down
19 changes: 1 addition & 18 deletions roles/kubernetes/node/templates/kubelet.service.j2
Original file line number Diff line number Diff line change
Expand Up @@ -14,25 +14,8 @@ Wants={{ kubelet_dependency }}
{% endfor %}

[Service]
Slice={{ kube_slice }}
EnvironmentFile=-{{ kube_config_dir }}/kubelet.env
{% if system_reserved|bool %}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpu/{{ system_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuacct/{{ system_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuset/{{ system_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/hugetlb/{{ system_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/memory/{{ system_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/pids/{{ system_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/systemd/{{ system_reserved_cgroups_for_service_slice }}
{% endif %}
{% if kube_reserved|bool %}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpu/{{ kube_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuacct/{{ kube_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/cpuset/{{ kube_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/hugetlb/{{ kube_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/memory/{{ kube_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/pids/{{ kube_reserved_cgroups_for_service_slice }}
ExecStartPre=/bin/mkdir -p /sys/fs/cgroup/systemd/{{ kube_reserved_cgroups_for_service_slice }}
{% endif %}
ExecStart={{ bin_dir }}/kubelet \
$KUBE_LOGTOSTDERR \
$KUBE_LOG_LEVEL \
Expand Down
8 changes: 8 additions & 0 deletions roles/kubernetes/node/vars/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
kube_slice_cgroup: "/{{ kube_slice.split('-') | join('.slice/') }}/"
system_slice_cgroup: "/{{ system_slice.split('-') | join('.slice/') }}/"
Comment on lines +2 to +3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those seems to be a bit weird kube_slice and system_slice are equal respectively to rumtime.slice and system.slice so afaiu this would be the same as kube_slice and system_slice atm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly.
It would be /runtime.slice/ and /system.slice/.
This is intented to make the translation from systemd slice units to the correspond cgroup tree in /sys/fs/cgroup/

so runtime.slice -> /sys/fs/cgroup/runtime.slice/
nested-runtime.slice -> /sys/fs/cgroup/nested.slice/runtime.sliceBut I'm just seeing now that this does not work exactly that way, instead it should benested-runtime.slice /sys/fs/cgroup/nested.slice/nested-runtime.slice`

I'll go fix that

Copy link
Member

@MrFreezeex MrFreezeex Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes indeed forgot the / but I was more wondering about the split/join which doesn't do anything here AFAIU (?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't do anything on slice immediately under the root slice (like kube.slice).

However, if a kubespray user define kube_slice: orchestrator-kube.slice the corresponding cgroup will be
/sys/fs/cgroup/orchestrator.slice/orchestrator-kube.slice/ (I had mistakenly assumed it was orchestrator.slice/kube.slice/, which the current code reflects.)

However, I need to do more research on that. It seems that maybe the *ReservedCgroup setting are interpreted differently depending on the cgroupDriver used (cgroupfs/systemd) and in systemd case the translation from slice to cgroup would be done directly by the kubelet. Not completely sure though so I've asked on slack sig-node ( https://kubernetes.slack.com/archives/C0BP8PW9G/p1729771784322639 )

enforce_node_allocatable_stub:
pods: "{{ enforce_allocatable_pods }}"
kube-reserved: "{{ enforce_allocatable_kube_reserved }}"
system-reserved: "{{ enforce_allocatable_system_reserved }}"
enforce_node_allocatable: "{{ enforce_node_allocatable_stub | dict2items | selectattr('value') | map(attribute='key') }}"
Loading