-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2992 from itskingori/node_resource_handling
Automatic merge from submit-queue Add documentation on handling node resources At a minimum, this is meant to give more context on why the feature in #2982 was added and attempts to give some recommendations of what to consider when evaluating node system resources. I hope this spurs some discussion and that the recommendations I make maybe be assessed further. For example ... in one of the links I referenced, we're advised to set `system-reserved` **only if we know what we are doing** (which I can't say I do 💯% ... 🤷♂️) and we're even warned to only set it if you really need to.
- Loading branch information
Showing
1 changed file
with
131 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
## Node Resource Handling In Kuberenetes | ||
|
||
An aspect of Kubernetes clusters that is often overlooked is the resources non- | ||
pod components require to run, such as: | ||
|
||
* Operating system components i.e. `sshd`, `udev` etc. | ||
* Kubernetes system components i.e. `kubelet`, `container runtime` (e.g. | ||
Docker), `node problem detector`, `journald` etc. | ||
|
||
As you manage your cluster, it's important that you are cognisant of these | ||
components because if your critical non-pod components don't have enough | ||
resources, you might end up with a very unstable cluster. | ||
|
||
### Understanding Node Resources | ||
|
||
Each node in a cluster has resources available to it and pods scheduled to run | ||
on the node may or may not have resource requests or limits set on them. | ||
Kubernetes schedules pods on nodes that have resources that satisfy the pod's | ||
specified requirements. Broadly, pods are [bin-packed][4] onto the nodes in a | ||
best effort attempt to utilize as much of the resources available with as few | ||
nodes as possible. | ||
|
||
``` | ||
Node Capacity | ||
--------------------------- | ||
| kube-reserved | | ||
|-------------------------| | ||
| system-reserved | | ||
|-------------------------| | ||
| eviction-threshold | | ||
|-------------------------| | ||
| | | ||
| allocatable | | ||
| (available for pods) | | ||
| | | ||
| | | ||
--------------------------- | ||
``` | ||
|
||
Node resources can be be categorised into 4 (as shown above): | ||
|
||
* `kube-reserved` – reserves resources for kubernetes system daemons. | ||
* `system-reserved` – reserves resources for operating system components. | ||
* `eviction-threshold` – specifies limits that trigger evictions when node | ||
resources drop below the reserved value. | ||
* `allocatable` – the remaining node resources available for scheduling of pods | ||
when `kube-reserved`, `system-reserved` and `eviction-threshold` resources | ||
have been accounted for. | ||
|
||
For example, with a 30.5 GB, 4 vCPUs machine with only `eviction-thresholds` set | ||
as `--eviction-hard=memory.available<100Mi` we'd get the following `Capacity` | ||
and `Allocatable` resources: | ||
|
||
``` | ||
$ kubectl describe node/ip-xx-xx-xx-xxx.internal | ||
... | ||
Capacity: | ||
cpu: 4 | ||
memory: 31402412Ki | ||
... | ||
Allocatable: | ||
cpu: 4 | ||
memory: 31300012Ki | ||
... | ||
``` | ||
|
||
### So, What Could Possibly Go Wrong? | ||
|
||
The scheduler ensures that for each resource type, the sum of the resources | ||
scheduled does not surpass the sum of allocatable resources. But suppose you | ||
have a couple of applications deployed in your cluster that are constantly using | ||
up way more resources set in their resource requests (burst above requests but | ||
below limits during workload). You end up with a node with pods that are each | ||
attempting to take take up more resources than there are available on the node! | ||
|
||
This is particularly an issue with non-compressible resources like memory. For | ||
example, in the aforementioned case, with an eviction threshold of only | ||
`memory.available<100Mi` and no `kube-reserved` nor `system-reserved` | ||
reservations set, it is possible for a node to OOM prior to when `kubelet` is | ||
able to reclaim memory (because it may not observe memory pressure right away, | ||
since it polls `cAdvisor` to collect memory usage stats at a regular interval). | ||
|
||
All the while, keep in mind that without `kube-reserved` nor `system-reserved` | ||
reservations set (which is most clusters i.e. [GKE][5], [Kops][6]), the | ||
scheduler doesn't account for resources that non-pod components would require to | ||
function properly because `Capacity` and `Allocatable` resources are more or | ||
less equal. | ||
|
||
### Where Do We Go From Here? | ||
|
||
It's difficult to give a one size fits all answer to node resource allocation. | ||
The behaviour of your cluster depends on the resource requirements of the apps | ||
running on the cluster, the pod density and the cluster size. But there's a | ||
[node performance dashboard][7] that exposes `cpu` and `memory` usage profiles | ||
of `kubelet` and `docker` engine at multiple levels of pod density which may | ||
serve as a guide for what values would be appropriate for your cluster. | ||
|
||
But, it seems fitting to recommend the following: | ||
|
||
1. Always set requests with some breathing room – do not set requests to match | ||
your application's resource profile during idle time too closely. | ||
2. Always set limits – so that your application doesn't hog all the memory on a | ||
node during a spike. | ||
3. Don't set your limits for imcompressible resources too high - at the end of | ||
the day, the Kubernetes scheduler schedules based on resource requests which | ||
match what's available on the node. During a spike, your pod technically will | ||
try to access resources outside what it's guaranteed to have access to. As | ||
explained before, this can be an issue if a bunch of your pods are all | ||
bursting at the same time. | ||
4. Increase eviction thresholds if they are too low - while extreme utilization | ||
is ideal, it might be too close to the edge such that the system doesn't have | ||
enought time to reclaim resources via evictions if the resource increases | ||
within that window rapidly. | ||
5. Reserve resources for system components once you've been able to profile your | ||
nodes i.e. `kube-reserved` and `system-reserved`. | ||
|
||
**Further Reading:** | ||
|
||
* [Configure Out Of Resource Handling][2] | ||
* [Reserve Compute Resources for System Daemons][1] | ||
* [Managing Compute Resources for Containers][3] | ||
* [Visualize Kubelet Performance with Node Dashboard][8] | ||
|
||
[1]: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ | ||
[2]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/ | ||
[3]: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/ | ||
[4]: https://en.wikipedia.org/wiki/Bin_packing_problem | ||
[5]: https://cloud.google.com/container-engine/ | ||
[6]: https://github.com/kubernetes/kops | ||
[7]: http://node-perf-dash.k8s.io/#/builds | ||
[8]: http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html |