Skip to content

Commit

Permalink
Start controller scalability
Browse files Browse the repository at this point in the history
  • Loading branch information
timebertt committed Dec 16, 2023
1 parent 211421e commit 4111e00
Show file tree
Hide file tree
Showing 3 changed files with 80 additions and 15 deletions.
20 changes: 12 additions & 8 deletions content/10-motivation.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
# Motivation

- clarify relevance, need in community
- issues in many projects asking for sharding
- istio: <https://github.com/istio/istio/issues/22208>
- velero: <https://github.com/vmware-tanzu/velero/issues/487>
- controller-runtime: <https://github.com/kubernetes-sigs/controller-runtime/issues/2576>
- Operator SDK: <https://github.com/operator-framework/operator-sdk/issues/1540>
- Metacontroller: <https://github.com/GoogleCloudPlatform/metacontroller/issues/190>
- large-scale Kubernetes-based and controller-based deployments
- core Kubernetes components scale well
- sig-scalability cares about scalability of core components, but core components only [@k8scommunity]
Expand All @@ -17,5 +11,15 @@
- <https://twitter.com/ibuildthecloud/status/1717369625904848945>
- custom controllers/operators typically facilitate heavier reconciliation processes compared to core controllers [@kubevela]
- some projects with large-scale deployments have already implemented sharding on their own
- highly specific to individual projects, cannot be reused
- there is no common design or implementation, that can be applied to any arbitrary controller
- highly specific to individual projects
- cannot be reused
- all implementations face similar challenges
- typically, not fully matured yet
- sharding is asked for/considered in many projects
- istio: <https://github.com/istio/istio/issues/22208>
- velero: <https://github.com/vmware-tanzu/velero/issues/487>
- controller-runtime: <https://github.com/kubernetes-sigs/controller-runtime/issues/2576>
- Operator SDK: <https://github.com/operator-framework/operator-sdk/issues/1540>
- Metacontroller: <https://github.com/GoogleCloudPlatform/metacontroller/issues/190>
- there is no common design or implementation, that can be applied to arbitrary controllers
- reusable concept and implementation would benefit the controller ecosystem
38 changes: 31 additions & 7 deletions content/20-fundamentals.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,10 +254,29 @@ However, it also restricts reconciliations of all objects to be performed by a s

## Scalability of Controllers

Scalability describes the ability of a system to handle increased load with adequate performance given that more resources are added to the system [@herbst2013elasticity; @bondi2000characteristics].
Quantifying the load of a system reveals different dimensions depending on the system in question.
A commonly accepted approach for measuring the scalability of a system is to evaluate at which scale the system can operate without faults or decreased performance. [@duboc2007framework]

The basis for evaluating the scalability of a system is to define the central performance indicators directly related to user experience.
In the context of reliability engineering, these are referred to as service level indicators (SLIs) and must be measurable in a running system [@beyer2016site].
Next, target values (service level objectives) for the chosen performance indicators must be defined.
As long as the measured performance meets the desired targets, the system can be considered to be performing adequately and without faults.
Based on that, experimentation can be performed to test under which amount of load the system can operate while fulfilling the objectives. [@jogalekar2000evaluating; @sanders201578]

- to measure scalability of a system, one must measure performance of the system at a given scale

- no commonly accepted definition for scalability of controllers
- but there is scalability definition for Kubernetes as a whole
- definition for scalability of controllers can be derived from it
- sig-scalability definition for Kubernetes scalability: <https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#how-we-define-scalability>
- load/scale of Kubernetes cluster has many dimensions (hard to test in every dimension)
- tests are performed to measure whether a cluster fulfills certain SLOs when load dimensions are kept within targeted thresholds
- guarantees certain SLOs if cluster is within thresholds
- if SLOs are not met while keeping thresholds, means scalability goals are not met
- if thresholds can be increased while keeping SLOs, means greater scalability of the system
- always measures/ensures scalability of a concrete setup
- e.g., based on the size of control plane machines, how far can the system be scaled without decreasing user experience
- thresholds: <https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md>
- mostly related to number of objects
- some related to query/change rates
Expand All @@ -273,9 +292,14 @@ However, it also restricts reconciliations of all objects to be performed by a s
- API-related latencies: watch, admission, webhook
- sig-scalability tests
- see <https://github.com/kubernetes/community/blob/master/contributors/devel/README.md#sig-scalability>
- define how scalability of controllers can be measured

- based on this definition, we define how scalability of controllers can be measured
- environment requirements/prerequisites
- reasonable API server latency
- measure performance of a concrete setup at a given scale
- important characteristics of the setup need to be captured: size of the controller
- resource limits, network bandwidth, actual usage thereof
- number of worker routines
- thresholds
- number of objects
- object churn: creation/update rate (reconciliation rate ~ "throughput")
Expand All @@ -284,15 +308,15 @@ However, it also restricts reconciliations of all objects to be performed by a s
- queue time: p99 < 1s
- webhook call latency (if controller has webhooks)
- same as for Kubernetes itself: if thresholds can be increased while keeping SLOs, greater scalability
- factors that influence scalability
- resource limits
- network bandwidth
- how to consider actual usage (better measurement for size/cost of controllers)
- number of worker routines

## Scalability Limitations

- core mechanisms of controllers cause the heavy resource usage
- one can increase the limits of the setup (e.g., higher limits, more worker routines) to increase performance at scale
- this is vertical scaling
- cannot be scaled infinitely
- scaling vertically in extremes, can hit other limitations, e.g., machine size, network bandwidth

- which core mechanisms of controllers cause the heavy resource usage
- watch events: CPU for decoding, network transfer
- watch cache: memory
- no horizontal scalability, no distribution of work, no active-active setups
Expand Down
37 changes: 37 additions & 0 deletions content/bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,43 @@ @article{jogalekar2000evaluating
publisher = {IEEE}
}

@inproceedings{duboc2007framework,
title = {A framework for characterization and analysis of software system scalability},
author = {Duboc, Leticia and Rosenblum, David and Wicks, Tony},
booktitle = {Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering},
pages = {375--384},
year = {2007}
}

@inproceedings{herbst2013elasticity,
title = {Elasticity in cloud computing: What it is, and what it is not},
author = {Herbst, Nikolas Roman and Kounev, Samuel and Reussner, Ralf},
booktitle = {10th international conference on autonomic computing (ICAC 13)},
pages = {23--27},
year = {2013}
}

@article{sanders201578,
title = {CloudStore – Towards Scalability Benchmarking in Cloud Computing},
journal = {Procedia Computer Science},
volume = {68},
pages = {78-88},
year = {2015},
note = {1st International Conference on Cloud Forward: From Distributed to Complete Computing},
issn = {1877-0509},
doi = {https://doi.org/10.1016/j.procs.2015.09.225},
url = {https://www.sciencedirect.com/science/article/pii/S1877050915030707},
author = {Richard Sanders and Gunnar Brataas and Mariano Cecowski and Kjetil Haslum and Simon Ivanšek and Jure Polutnik and Brynjar Viken},
keywords = {Cloud computing, Measurements, Scalability, Performance, Capacity, Elasticity, Efficiency, Amazon Web Services, AMS, TCP-W}
}

@book{beyer2016site,
title = {Site reliability engineering: How Google runs production systems},
author = {Beyer, Betsy and Jones, Chris and Petoff, Jennifer and Murphy, Niall Richard},
year = {2016},
publisher = {" O'Reilly Media, Inc."}
}

@inproceedings{soltesz2007container,
title = {Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors},
author = {Soltesz, Stephen and P{\"o}tzl, Herbert and Fiuczynski, Marc E and Bavier, Andy and Peterson, Larry},
Expand Down

0 comments on commit 4111e00

Please sign in to comment.