Automatically taint and evict nodes with high CPU overload based on chosen Metric PSI or Avarage Load. Derived from kubernetes-loadwatcher.
This actually started as a small extension of kubernetes-pressurecooker just to do the job.But there were popping more and more things that we needed. It became multicooker once we tried to move with it to GKE and we hit a wall with it. Because google have quite a bit of different kernels on their machines. Some of them has PSI others doesnt.
The load average describes the average length of the run queue whenever a scheduling decision is made. But it does not tell us how often processes were waiting for CPU time. The kernel pressure metrics (psi by facebook) describes how often there was not enough CPU available.
There some big clound providers that doesnt support PSI metrics out of the box. I'm looking at you Google
That is why there is a flag -use-avarage
to choose load avarage metrics.
A kubernetes node can be overcommited on CPU: there might be more processes that want more CPU than requested. This can easily happen due to variable resource usage per pod, variance in hardware or variance in pod distributions. By default, Kubernetes will not evict Pods from a node based on CPU usage, since CPU is considered a compressible resource. However if a node does not have enough CPU resources to handle all pods it will impose additional latencies that can be undesirable based on the workload (e.g. web/interactive traffic).
This project contains a small Kubernetes controller that watches each node's CPU pressure; when a certain threshold is exceeded, the node will be tainted (so that no additional workloads are scheduled on an already-overloaded node) and finally the controller will start to evict Pods from the node.
Pressure is more sensitive for small overloads, e.g. with pressure information it is easy to express "there is an up to 20% chance to not get CPU instantly when needed".
This controller can be started with two threshold flags: -taint-threshold
and -evict-threshold
. There are also safeguard flags -min-pod-age
and -eviction-backoff
.
There are also few configuration flags
Use-use-avarage
to choose load avarage metrics instead of PSI.
Use -target-metric
to choose the metric that will be a good fit to use for the threshold.
Possible values are 1,2,3 based on the metric type. For LoadAva this will be [Load1, Load5,Load15] respectively for PSI it will be [Avg10,Avg60,Avg300]
The controller will continuously monitor a node's CPU pressure.
-
If the target-metric exceeds the taint threshold, the node will be tainted with a
multicooker/load-exceeded
taint with thePreferNoSchedule
effect. This will instruct Kubernetes to not schedule any additional workloads on this node if at all possible. -
Once node is tainted target metric is moved to the first one so the controller will be more reactive.
-
If the ALL the metrics falls back below the taint threshold, the taint will be removed again.
-
If the the FIRST Metric (Load1 Avg10) exceeds the eviction threshold, the controller will pick a suitable Pod running on the node and evict it. However, the following types of Pods will not be evicted:
- Pods with the
Guaranteed
QoS class - Pods belonging to Stateful Sets
- Pods belonging to Daemon Sets
- Standalone pods not managed by any kind of controller
- Pods running in the
kube-system
namespace or with a criticalpriorityClassName
- Pods newer than min-pod-age
- Pods with the
After a Pod was evicted, the next Pod will be evicted after a configurable eviction backoff (controllable using the evict-backoff
argument) if the FIRST Metric (Load1 Avg10) is still above the eviction threshold.
Older pods will be evicted first. The ration to remove old pods first is tat it is usually better to move well behaving pods away from bad neighbors than moving bad neighbors through the cluster. And as a node will always stay in a healthy state it can be assumed that the older pods are less likely to be the cause of an overload.
There is a helm chart in the repo. To install from repo folder:
helm upgrade --install --namespace kube-system kubernetes-multicooker chart/
- Create tests
- Fix prometheus metrics to be per node- release 1.0.2