Skip to content

Commit

Permalink
Placeholder for KEP-4006
Browse files Browse the repository at this point in the history
  • Loading branch information
seans3 committed Nov 14, 2023
1 parent 4efddf9 commit de679c6
Show file tree
Hide file tree
Showing 9 changed files with 1,887 additions and 0 deletions.
35 changes: 35 additions & 0 deletions content/en/blog/_posts/2023-11-07-introducing-sig-etcd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
layout: blog
title: "Introducing SIG etcd"
slug: introducing-sig-etcd
date: 2023-11-07
canonicalUrl: https://etcd.io/blog/2023/introducing-sig-etcd/
---

**Authors**: Han Kang (Google), Marek Siarkowicz (Google), Frederico Muñoz (SAS Institute)

Special Interest Groups (SIGs) are a fundamental part of the Kubernetes project, with a substantial share of the community activity happening within them. When the need arises, [new SIGs can be created](https://github.com/kubernetes/community/blob/master/sig-wg-lifecycle.md), and that was precisely what happened recently.

[SIG etcd](https://github.com/kubernetes/community/blob/master/sig-etcd/README.md) is the most recent addition to the list of Kubernetes SIGs. In this article we will get to know it a bit better, understand its origins, scope, and plans.

## The critical role of etcd

If we look inside the control plane of a Kubernetes cluster, we will find [etcd](https://kubernetes.io/docs/concepts/overview/components/#etcd), a consistent and highly-available key value store used as Kubernetes' backing store for all cluster data -- this description alone highlights the critical role that etcd plays, and the importance of it within the Kubernetes ecosystem.

This critical role makes the health of the etcd project and community an important consideration, and [concerns about the state of the project](https://groups.google.com/a/kubernetes.io/g/steering/c/e-O-tVSCJOk/m/N9IkiWLEAgAJ) in early 2022 did not go unnoticed. The changes in the maintainer team, amongst other factors, contributed to a situation that needed to be addressed.

## Why a special interest group

With the critical role of etcd in mind, it was proposed that the way forward would be to create a new special interest group. If etcd was already at the heart of Kubernetes, creating a dedicated SIG not only recognises that role, it would make etcd a first-class citizen of the Kubernetes community.

Establishing SIG etcd creates a dedicated space to make explicit the contract between etcd and Kubernetes api machinery and to prevent, on the etcd level, changes which violate this contract. Additionally, etcd will be able to adopt the processes that Kubernetes offers its SIGs ([KEPs](https://www.kubernetes.dev/resources/keps/), [PRR](https://github.com/kubernetes/community/blob/master/sig-architecture/production-readiness.md), [phased feature gates](https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/), amongst others) in order to improve the consistency and reliability of the codebase. Being able to use these processes will be a substantial benefit to the etcd community.

As a SIG, etcd will also be able to draw contributor support from Kubernetes proper: active contributions to etcd from Kubernetes maintainers would decrease the likelihood of breaking Kubernetes changes, through the increased number of potential reviewers and the integration with existing testing framework. This will not only benefit Kubernetes, which will be able to better participate and shape the direction of etcd in terms of the critical role it plays, but also etcd as a whole.

## About SIG etcd

The recently created SIG is already working towards its goals, defined in its [Charter](https://github.com/kubernetes/community/blob/master/sig-etcd/charter.md) and [Vision](https://github.com/kubernetes/community/blob/master/sig-etcd/vision.md). The purpose is clear: to ensure etcd is a reliable, simple, and scalable production-ready store for building cloud-native distributed systems and managing cloud-native infrastructure via orchestrators like Kubernetes.

The scope of SIG etcd is not exclusively about etcd as a Kubernetes component, it also covers etcd as a standard solution. Our goal is to make etcd the most reliable key-value storage to be used anywhere, unconstrained by any Kubernetes-specific limits and scaling to meet the requirements of many diverse use-cases.

We are confident that the creation of SIG etcd constitutes an important milestone in the lifecycle of the project, simultaneously improving etcd itself, and also the integration of etcd with Kubernetes. We invite everyone interested in etcd to [visit our page](https://github.com/kubernetes/community/blob/master/sig-etcd/README.md), [join us at our Slack channel](https://kubernetes.slack.com/messages/etcd), and get involved in this new stage of etcd's life.
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
layout: blog
title: "The Case for Kubernetes Resource Limits: Predictability vs. Efficiency"
date: 2023-11-16
slug: the-case-for-kubernetes-resource-limits
---

**Author:** Milan Plžík (Grafana Labs)

There’s been quite a lot of posts suggesting that not using Kubernetes resource limits might be a fairly useful thing (for example, [For the Love of God, Stop Using CPU Limits on Kubernetes](https://home.robusta.dev/blog/stop-using-cpu-limits/) or [Kubernetes: Make your services faster by removing CPU limits](https://erickhun.com/posts/kubernetes-faster-services-no-cpu-limits/) ). The points made there are totally valid – it doesn’t make much sense to pay for compute power that will not be used due to limits, nor to artificially increase latency. This post strives to argue that limits have their legitimate use as well.

As a Site Reliability Engineer on the [Grafana Labs](https://grafana.com/) platform team, which maintains and improves internal infrastructure and tooling used by the product teams, I primarily try to make Kubernetes upgrades as smooth as possible. But I also spend a lot of time going down the rabbit hole of various interesting Kubernetes issues. This article reflects my personal opinion, and others in the community may disagree.

Let’s flip the problem upside down. Every pod in a Kubernetes cluster has inherent resource limits – the actual CPU, memory, and other resources of the machine it’s running on. If those physical limits are reached by a pod, it will experience throttling similar to what is caused by reaching Kubernetes limits.

## The problem
Pods without (or with generous) limits can easily consume the extra resources on the node. This, however, has a hidden cost – the amount of extra resources available often heavily depends on pods scheduled on the particular node and their actual load. These extra resources make each pod a special snowflake when it comes to real resource allocation. Even worse, it’s fairly hard to figure out the resources that the pod had at its disposal at any given moment – certainly not without unwieldy data mining of pods running on a particular node, their resource consumption, and similar. And finally, even if we pass this obstacle, we can only have data sampled up to a certain rate and get profiles only for a certain fraction of our calls. This can be scaled up, but the amount of observability data generated might easily reach diminishing returns. Thus, there’s no easy way to tell if a pod had a quick spike and for a short period of time used twice as much memory as usual to handle a request burst.

Now, with Black Friday and Cyber Monday approaching, businesses expect a surge in traffic. Good performance data/benchmarks of the past performance allow businesses to plan for some extra capacity. But is data about pods without limits reliable? With memory or CPU instant spikes handled by the extra resources, everything might look good according to past data. But once the pod bin-packing changes and the extra resources get more scarce, everything might start looking different – ranging from request latencies rising negligibly to requests slowly snowballing and causing pod OOM kills. While almost no one actually cares about the former, the latter is a serious issue that requires instant capacity increase.

## Configuring the limits
Not using limits takes a tradeoff – it opportunistically improves the performance if there are extra resources available, but lowers predictability of the performance, which might strike back in the future. There are a few approaches that can be used to increase the predictability again. Let’s pick two of them to analyze:

- **Configure workload limits to be a fixed (and small) percentage more than the requests** – I'll call it _fixed-fraction headroom_. This allows the use of some extra shared resources, but keeps the per-node overcommit bound and can be taken to guide worst-case estimates for the workload. Note that the bigger the limits percentage is, the bigger the variance in the performance that might happen across the workloads.
- **Configure workloads with `requests` = `limits`**. From some point of view, this is equivalent to giving each pod its own tiny machine with constrained resources; the performance is fairly predictable. This also puts the pod into the _Guaranteed_ QoS class, which makes it get evicted only after _BestEffort_ and _Burstable_ pods have been evicted by a node under resource pressure (see [Quality of Service for Pods](/docs/concepts/workloads/pods/pod-qos/)).

Some other cases might also be considered, but these are probably the two simplest ones to discuss.


## Cluster resource economy
Note that in both cases discussed above, we’re effectively preventing the workloads from using some cluster resources it has at the cost of getting more predictability – which might sound like a steep price to pay for a bit more stable performance. Let’s try to quantify the impact there.

### Bin-packing and cluster resource allocation
Firstly, let’s discuss bin-packing and cluster resource allocation. There’s some inherent cluster inefficiency that comes to play – it’s hard to achieve 100% resource allocation in a Kubernetes cluster. Thus, some percentage will be left unallocated.

When configuring fixed-fraction headroom limits, a proportional amount of this will be available to the pods. If the percentage of unallocated resources in the cluster is lower than the constant we use for setting fixed-fraction headroom limits (see the figure, line 2), all the pods together are able to theoretically use up all the node’s resources; otherwise there are some resources that will inevitably be wasted (see the figure, line 1). In order to eliminate the inevitable resource waste, the percentage for fixed-fraction headroom limits should be configured so that it’s at least equal to the expected percentage of unallocated resources.

{{<figure alt="Chart displaying various requests/limits configurations" width="40%" src="requests-limits-configurations.svg">}}

For requests = limits (see the figure, line 3), this does not hold: Unless we’re able to allocate all node’s resources, there’s going to be some inevitably wasted resources. Without any knobs to turn on the requests/limits side, the only suitable approach here is to ensure efficient bin-packing on the nodes by configuring correct machine profiles. This can be done either manually or by using a variety of cloud service provider tooling – for example [Karpenter](https://karpenter.sh/) for EKS or [GKE Node auto provisioning](https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning).

### Optimizing actual resource utilization
Free resources also come in the form of unused resources of other pods (reserved vs. actual CPU utilization, etc.), and their availability can’t be predicted in any reasonable way. Configuring limits makes it next to impossible to utilize these. Looking at this from a different perspective, if a workload wastes a significant amount of resources it has requested, re-visiting its own resource requests might be a fair thing to do. Looking at past data and picking more fitting resource requests might help to make the packing more tight (although at the price of worsening its performance – for example increasing long tail latencies).

## Conclusion
Optimizing resource requests and limits is hard. Although it’s much easier to break things when setting limits, those breakages might help prevent a catastrophe later by giving more insights into how the workload behaves in bordering conditions. There are cases where setting limits makes less sense: batch workloads (which are not latency-sensitive – for example non-live video encoding), best-effort services (don’t need that level of availability and can be preempted), clusters that have a lot of spare resources by design (various cases of specialty workloads – for example services that handle spikes by design).

On the other hand, setting limits shouldn’t be avoided at all costs – even though figuring out the "right” value for limits is harder and configuring a wrong value yields less forgiving situations. Configuring limits helps you learn about a workload’s behavior in corner cases, and there are simple strategies that can help when reasoning about the right value. It’s a tradeoff between efficient resource usage and performance predictability and should be considered as such.

There’s also an economic aspect of workloads with spiky resource usage. Having “freebie” resources always at hand does not serve as an incentive to improve performance for the product team. Big enough spikes might easily trigger efficiency issues or even problems when trying to defend a product’s SLA – and thus, might be a good candidate to mention when assessing any risks.
Loading

0 comments on commit de679c6

Please sign in to comment.