Start controller scalability

timebertt · Dec 16, 2023 · 4111e00 · 4111e00
1 parent 211421e
commit 4111e00
Show file tree

Hide file tree

Showing 3 changed files with 80 additions and 15 deletions.
diff --git a/content/10-motivation.md b/content/10-motivation.md
@@ -1,12 +1,6 @@
 # Motivation
 
 - clarify relevance, need in community
-  - issues in many projects asking for sharding
-    - istio: <https://github.com/istio/istio/issues/22208>
-    - velero: <https://github.com/vmware-tanzu/velero/issues/487>
-    - controller-runtime: <https://github.com/kubernetes-sigs/controller-runtime/issues/2576>
-    - Operator SDK: <https://github.com/operator-framework/operator-sdk/issues/1540>
-    - Metacontroller: <https://github.com/GoogleCloudPlatform/metacontroller/issues/190>
 - large-scale Kubernetes-based and controller-based deployments
 - core Kubernetes components scale well
   - sig-scalability cares about scalability of core components, but core components only [@k8scommunity]
@@ -17,5 +11,15 @@
   - <https://twitter.com/ibuildthecloud/status/1717369625904848945>
 - custom controllers/operators typically facilitate heavier reconciliation processes compared to core controllers [@kubevela]
 - some projects with large-scale deployments have already implemented sharding on their own
-- highly specific to individual projects, cannot be reused
-- there is no common design or implementation, that can be applied to any arbitrary controller
+  - highly specific to individual projects
+  - cannot be reused
+  - all implementations face similar challenges
+  - typically, not fully matured yet
+- sharding is asked for/considered in many projects
+  - istio: <https://github.com/istio/istio/issues/22208>
+  - velero: <https://github.com/vmware-tanzu/velero/issues/487>
+  - controller-runtime: <https://github.com/kubernetes-sigs/controller-runtime/issues/2576>
+  - Operator SDK: <https://github.com/operator-framework/operator-sdk/issues/1540>
+  - Metacontroller: <https://github.com/GoogleCloudPlatform/metacontroller/issues/190>
+- there is no common design or implementation, that can be applied to arbitrary controllers
+- reusable concept and implementation would benefit the controller ecosystem
diff --git a/content/20-fundamentals.md b/content/20-fundamentals.md
@@ -254,10 +254,29 @@ However, it also restricts reconciliations of all objects to be performed by a s
 
 ## Scalability of Controllers
 
+Scalability describes the ability of a system to handle increased load with adequate performance given that more resources are added to the system [@herbst2013elasticity; @bondi2000characteristics].
+Quantifying the load of a system reveals different dimensions depending on the system in question.
+A commonly accepted approach for measuring the scalability of a system is to evaluate at which scale the system can operate without faults or decreased performance. [@duboc2007framework]
+
+The basis for evaluating the scalability of a system is to define the central performance indicators directly related to user experience.
+In the context of reliability engineering, these are referred to as service level indicators (SLIs) and must be measurable in a running system [@beyer2016site].
+Next, target values (service level objectives) for the chosen performance indicators must be defined.
+As long as the measured performance meets the desired targets, the system can be considered to be performing adequately and without faults.
+Based on that, experimentation can be performed to test under which amount of load the system can operate while fulfilling the objectives. [@jogalekar2000evaluating; @sanders201578]
+
+- to measure scalability of a system, one must measure performance of the system at a given scale
+
+- no commonly accepted definition for scalability of controllers
+- but there is scalability definition for Kubernetes as a whole
+- definition for scalability of controllers can be derived from it
 - sig-scalability definition for Kubernetes scalability: <https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#how-we-define-scalability>
+  - load/scale of Kubernetes cluster has many dimensions (hard to test in every dimension)
+  - tests are performed to measure whether a cluster fulfills certain SLOs when load dimensions are kept within targeted thresholds
   - guarantees certain SLOs if cluster is within thresholds
   - if SLOs are not met while keeping thresholds, means scalability goals are not met
   - if thresholds can be increased while keeping SLOs, means greater scalability of the system
+  - always measures/ensures scalability of a concrete setup
+    - e.g., based on the size of control plane machines, how far can the system be scaled without decreasing user experience
   - thresholds: <https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md>
     - mostly related to number of objects
     - some related to query/change rates
@@ -273,9 +292,14 @@ However, it also restricts reconciliations of all objects to be performed by a s
     - API-related latencies: watch, admission, webhook
   - sig-scalability tests
   - see <https://github.com/kubernetes/community/blob/master/contributors/devel/README.md#sig-scalability>
-- define how scalability of controllers can be measured
+
+- based on this definition, we define how scalability of controllers can be measured
   - environment requirements/prerequisites
     - reasonable API server latency
+  - measure performance of a concrete setup at a given scale
+    - important characteristics of the setup need to be captured: size of the controller
+    - resource limits, network bandwidth, actual usage thereof
+    - number of worker routines
   - thresholds
     - number of objects
     - object churn: creation/update rate (reconciliation rate ~ "throughput")
@@ -284,15 +308,15 @@ However, it also restricts reconciliations of all objects to be performed by a s
     - queue time: p99 < 1s
     - webhook call latency (if controller has webhooks)
   - same as for Kubernetes itself: if thresholds can be increased while keeping SLOs, greater scalability
-- factors that influence scalability
-  - resource limits
-  - network bandwidth
-  - how to consider actual usage (better measurement for size/cost of controllers)
-  - number of worker routines
 
 ## Scalability Limitations
 
-- core mechanisms of controllers cause the heavy resource usage
+- one can increase the limits of the setup (e.g., higher limits, more worker routines) to increase performance at scale
+- this is vertical scaling
+- cannot be scaled infinitely
+- scaling vertically in extremes, can hit other limitations, e.g., machine size, network bandwidth
+
+- which core mechanisms of controllers cause the heavy resource usage
   - watch events: CPU for decoding, network transfer
   - watch cache: memory
 - no horizontal scalability, no distribution of work, no active-active setups

diff --git a/content/bibliography.bib b/content/bibliography.bib
@@ -125,6 +125,43 @@ @article{jogalekar2000evaluating
   publisher = {IEEE}
 }
 
+@inproceedings{duboc2007framework,
+  title     = {A framework for characterization and analysis of software system scalability},
+  author    = {Duboc, Leticia and Rosenblum, David and Wicks, Tony},
+  booktitle = {Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering},
+  pages     = {375--384},
+  year      = {2007}
+}
+
+@inproceedings{herbst2013elasticity,
+  title     = {Elasticity in cloud computing: What it is, and what it is not},
+  author    = {Herbst, Nikolas Roman and Kounev, Samuel and Reussner, Ralf},
+  booktitle = {10th international conference on autonomic computing (ICAC 13)},
+  pages     = {23--27},
+  year      = {2013}
+}
+
+@article{sanders201578,
+  title    = {CloudStore – Towards Scalability Benchmarking in Cloud Computing},
+  journal  = {Procedia Computer Science},
+  volume   = {68},
+  pages    = {78-88},
+  year     = {2015},
+  note     = {1st International Conference on Cloud Forward: From Distributed to Complete Computing},
+  issn     = {1877-0509},
+  doi      = {https://doi.org/10.1016/j.procs.2015.09.225},
+  url      = {https://www.sciencedirect.com/science/article/pii/S1877050915030707},
+  author   = {Richard Sanders and Gunnar Brataas and Mariano Cecowski and Kjetil Haslum and Simon Ivanšek and Jure Polutnik and Brynjar Viken},
+  keywords = {Cloud computing, Measurements, Scalability, Performance, Capacity, Elasticity, Efficiency, Amazon Web Services, AMS, TCP-W}
+}
+
+@book{beyer2016site,
+  title     = {Site reliability engineering: How Google runs production systems},
+  author    = {Beyer, Betsy and Jones, Chris and Petoff, Jennifer and Murphy, Niall Richard},
+  year      = {2016},
+  publisher = {" O'Reilly Media, Inc."}
+}
+
 @inproceedings{soltesz2007container,
   title     = {Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors},
   author    = {Soltesz, Stephen and P{\"o}tzl, Herbert and Fiuczynski, Marc E and Bavier, Andy and Peterson, Larry},