Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize cgroups settings for node reserved #9209

Merged
merged 8 commits into from
Dec 30, 2022

Conversation

shelmingsong
Copy link
Contributor

@shelmingsong shelmingsong commented Aug 23, 2022

What type of PR is this?

/kind feature

What this PR does / why we need it:

If cgroups driver is systemd and kube_reserved or system_reserved is set

  • Specify kubeReservedCgroup or systemReservedCgroup in kubelet-config.v1beta1.yaml.j2
  • Pre-create the cgroups folder before the Kubelet service starts
  • Specify the cgroups slice in the Containerd Service

The cgroups setup and hierarchy refer to this article

After the setup, the cgroups hierarchy is as follows:

/ (Cgroups Root)
├── kubepods.slice
│   ├── ...
│   ├── kubepods-besteffort.slice
│   ├── kubepods-burstable.slice
│   └── ...
├── kube.slice
│   ├── ...
│   ├── containerd.service
│   ├── kubelet.service
│   └── ...
├── system.slice
│   └── ...
└── ...

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

optimize cgroups settings for node reserved (using new `kube_reserved`, see docs for more information)

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 23, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @shelmingsong. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 23, 2022
@cristicalin
Copy link
Contributor

Hi @shelmingsong , thank you for submitting this.

Before we accept the change, could you please explain how is this an optimisation? The design document is from the 1.7 days (2018) and the kubelet cgroup driver already takes care of setting the hierarchy, adding extra logic to the systemd units seems counter-intuitive and prone to clash with future versions of kubernetes so I would personally like to understand what is the benefit we get by adding this.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 23, 2022
@shelmingsong
Copy link
Contributor Author

Hi @cristicalin , thank you for your reply.

Let me describe the reasons for these changes in detail.

  1. Specify kubeReservedCgroup or systemReservedCgroup in kubelet-config.v1beta1.yaml.j2

When enforceNodeAllocatable in kubelet-config.v1beta1.yaml contains kube-reserved or system-reserved, If kubeReservedCgroup or systemReservedCgroup is not specified, enforceNodeAllocatable will not take effect.

This can be seen in this official Kubernetes document: [enforcing-node-allocatable](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#enforcing-node-allocatable)

Optionally, kubelet can be made to enforce kube-reserved and system-reserved by specifying kube-reserved & system-reserved values in the same flag. Note that to enforce kube-reserved or system-reserved, --kube-reserved-cgroup or --system-reserved-cgroup needs to be specified respectively.

  • Pre-create the cgroups folder before the Kubelet service starts

kubeReservedCgroup and systemReservedCgroup will not take effect if we do not create the cgroups folder required by Kubelet in advance.
This can be seen in the following two official Kubernetes documents:

Note that Kubelet does not create --kube-reserved-cgroup if it doesn't exist. Kubelet will fail if an invalid cgroup is specified. With systemd cgroup driver, you should follow a specific pattern for the name of the cgroup you define: the name should be the value you set for --kube-reserved-cgroup, with .slice appended.

Note that kubelet does not create --system-reserved-cgroup if it doesn't exist. kubelet will fail if an invalid cgroup is specified. With systemd cgroup driver, you should follow a specific pattern for the name of the cgroup you define: the name should be the value you set for --system-reserved-cgroup, with .slice appended.

  • Specify the cgroups slice in the Containerd Service

If we do not specify cgroup slice in the Containerd service, then Containerd cgroup is created under systemd.slice by default, which is not as expected of kube-reserved.

Copy link
Contributor

@cristicalin cristicalin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @shelmingsong for the explanation, it makes sense to me.

/hold cancel

Please also address the additional comments regarding documenting this feature and support for the other CRIs

@@ -23,6 +23,10 @@ kubelet_kubelet_cgroups_cgroupfs: "/system.slice/kubelet.service"
kubelet_fail_swap_on: true

# Reserve this space for kube resources
# Set to true to reserve resources for kube daemons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add these variables to the sample files as well as they serve a documentation role.

It would also be useful to add some more extensive documentation for these variables in the docs/ folder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I misunderstood the yellow "Pending" decorator on my comments, so the above comment was just sent (it was actually commented many days ago).

@@ -35,6 +35,9 @@ LimitNOFILE=infinity
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999
{% if kube_reserved is defined and kube_reserved|bool %}
Slice={{ kube_reserved_cgroups_for_service_slice }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an equivalent for cri-o and docker (cri-dockerd) for these options? If not then the documentation should explicitly state that this capability is only available with containerd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I misunderstood the yellow "Pending" decorator on my comments, so the above comment was just sent (it was actually commented many days ago).

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2022
@cristicalin
Copy link
Contributor

Hi @shelmingsong , please rebase this PR on latest state of the master branch to be able to pass the CI tests.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 30, 2022
@shelmingsong
Copy link
Contributor Author

Hi @cristicalin , thank you for reminding me.
I rebased, and the CI tests have now passed

@cristicalin
Copy link
Contributor

Great work @shelmingsong, thank you!

/approve
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 31, 2022
kubelet_runtime_cgroups: "/systemd/system.slice"
kubelet_kubelet_cgroups: "/systemd/system.slice"
kubelet_runtime_cgroups: "{{ kube_reserved_cgroups }}/{{ container_manager }}.service"
kubelet_kubelet_cgroups: "{{ kube_reserved_cgroups }}/kubelet.service"
Copy link
Member

@yankay yankay Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HI @shelmingsong , Thank you very much.

After the PR , the default kubelet_kubelet_cgroups, would change from the /systemd/system.slice to the /kube.slice/kubelet.service. And not managed by the systemd.

is the default kubelet config changed after the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yankay Sorry for taking so long to reply.

Yes, the default kubelet configuration will be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I misunderstood the yellow "Pending" decorator on my comments, so the above comment was just sent (it was actually commented many days ago).

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2022
@shelmingsong
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2022
@floryut
Copy link
Member

floryut commented Dec 29, 2022

/remove-lifecycle stale

@shelmingsong Wow sorry I guess we missed this one, I'll do my best to take the time and review it, big one 😅

@shelmingsong
Copy link
Contributor Author

shelmingsong commented Dec 30, 2022 via email

Copy link
Member

@floryut floryut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
@shelmingsong Thank you for this feature, nice one 🙇

🚀

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 30, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cristicalin, floryut, shelmingsong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [cristicalin,floryut]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1c4db61 into kubernetes-sigs:master Dec 30, 2022
@floryut floryut mentioned this pull request Jan 4, 2023
enneitex pushed a commit to enneitex/kubespray that referenced this pull request Jan 25, 2023
* optimize cgroups settings for node reserved

* fix

* set cgroup slice for multi container engine

* set cgroup slice for crio

* add reserved cgroups variables to sample files

* Compatible with cgroup path for different container managers

* add cgroups doc

* fix markdown
@mirex05
Copy link

mirex05 commented Feb 10, 2023

Hello everyone, isn't this changes added some misconfiguration in kubelet.env?
Because if kube_reserved set to false (which is default), kubelet starts with flag --runtime-cgroups=/kube.slice/{{ container_manager }}.service, but runtime, for example, docker used with cri-dockerd still points out to system.slice. Maybe i'm wrong and missed something, so just want to specify this out.

HoKim98 pushed a commit to ulagbulag/kubespray that referenced this pull request Mar 8, 2023
* optimize cgroups settings for node reserved

* fix

* set cgroup slice for multi container engine

* set cgroup slice for crio

* add reserved cgroups variables to sample files

* Compatible with cgroup path for different container managers

* add cgroups doc

* fix markdown
nolimitkun pushed a commit to nolimitkun/kubespray that referenced this pull request Mar 19, 2023
* optimize cgroups settings for node reserved

* fix

* set cgroup slice for multi container engine

* set cgroup slice for crio

* add reserved cgroups variables to sample files

* Compatible with cgroup path for different container managers

* add cgroups doc

* fix markdown
pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024
* optimize cgroups settings for node reserved

* fix

* set cgroup slice for multi container engine

* set cgroup slice for crio

* add reserved cgroups variables to sample files

* Compatible with cgroup path for different container managers

* add cgroups doc

* fix markdown
@rptaylor
Copy link
Contributor

rptaylor commented Jul 9, 2024

I think there are two separate functions at play here:

  1. Resource reservation configured by KubeReserved and SystemReserved. This is like a k8s resource request. It is very important for cluster stability, to ensure that pods do not consume all node resources, to leave enough resources for daemons running on the node. For scheduling purposes the total resources on the node are simply reduced by the kube reserved and system reserved amounts, and the remaining node allocatable amount can be used by pods.
  2. Resource enforcement configured by kubeReservedCgroup and systemReservedCgroup. This is like a k8s resource limit. Applying this can be dangerous because it would kill the critical daemons if they exceed the specified amount of resources (though that may also depend on the value of enforceNodeAllocatable ?)

Note from the docs, the kubeReservedCgroup setting is optional. It is not required for reservation; it is for enforcement. You can (and should) set KubeReserved without kubeReservedCgroup.

To optionally enforce kubeReserved on kubernetes system daemons, specify the parent control group for kube daemons as the value for kubeReservedCgroup setting, and add kube-reserved to enforceNodeAllocatable

The behaviour before this MR was that by default, resource reservation was applied (via KubeReserved), and enforcement was not (the optional parameter kubeReservedCgroup was not applied). This is the best/safest/most resilient default behaviour. However after the MR, the resource reservation disappeared by default since kube_reserved is false by default, so pods could fully consume node resources, causing important node daemons to die.

If we do not specify cgroup slice in the Containerd service, then Containerd cgroup is created under systemd.slice by default, which is not as expected of kube-reserved.

I don't think that is true, because the safe standard way to reserve resources for kubernetes daemons should be to define KubeReserved resources, without specifying the optional and unnecessary kubeReservedCgroup setting. After this MR that is no longer possible, and there are no kube-reserved resources by default.

I spent many hours trying to figure this out, if I misunderstood or made an incorrect assumption I would be happy for a clarification.

Anyway can someone please re-open #9692 to address this? @cristicalin @yankay ? Thanks!

@rptaylor
Copy link
Contributor

rptaylor commented Jul 9, 2024

I have tried #11367 to fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants