Skip to content

Commit

Permalink
Write study project
Browse files Browse the repository at this point in the history
  • Loading branch information
timebertt committed Jan 2, 2024
1 parent e6b3fe2 commit 2835e63
Show file tree
Hide file tree
Showing 6 changed files with 76 additions and 19 deletions.
1 change: 1 addition & 0 deletions .vscode/cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
"kubelet",
"kubernetes",
"pandoc",
"rebalancing",
"schedulable",
"sharded",
"sharder",
Expand Down
Binary file added assets/study-project-cpu.pdf
Binary file not shown.
Binary file added assets/study-project-design.pdf
Binary file not shown.
Binary file added assets/study-project-memory.pdf
Binary file not shown.
63 changes: 44 additions & 19 deletions content/30-related-work.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,27 +37,47 @@ If there are less shared watches compared to a single instance responsible for a
All of these drawbacks limit the horizontal scalability of such sharded setups.
Hence, these approaches cannot be used for solving generally making Kubernetes controllers horizontally scalable.

- workload level
- ingress controller sharding by route: <https://docs.openshift.com/container-platform/4.14/networking/ingress-sharding.html>
- machine learning applications: <https://medium.com/workday-engineering/implementing-a-fully-automated-sharding-strategy-on-kubernetes-for-multi-tenanted-machine-learning-4371c48122ae>
- Prometheus, kube-state-metrics?
## Study Project {#sec:related-study-project}

## Study Project
![Study project sharding architecture [@studyproject]](../assets/study-project-design.pdf)

- describe shortcomings
- needs increased load tests
A previous study project [@studyproject] presents a design and implementation for sharding Kubernetes controllers by leveraging established sharding approaches from distributed databases.
The design introduces a sharder that runs in one of the controller instances as determined by a lease-based leader election mechanism.
It is responsible for sharding-related tasks like failure detection, partitioning, assignments, and coordination.

Summary:
To realize membership and failure detection, a lease-based mechanism inspired by Bigtable [@bigtable] is employed.
Each controller instance maintains a shard lease with its individual instance ID, serving as a heartbeat resource for announcing membership to the sharder.
The sharder watches these leases for reacting to state changes, and performs shard termination and object reassignment upon lease expiration or voluntary termination.

For partitioning, a variant of consistent hashing inspired by Apache Cassandra [@cassandradocs] and Amazon Dynamo [@dynamo] is used to determine shard ownership based on the API object's metadata.
This minimizes object movement during instance additions or removals and provides a simple, deterministic algorithm for shard assignment.

In addition to watching shard leases, the sharder also watches the metadata of sharded API objects to react to object creations and assignment changes.
Objects are assigned to specific controller instances by the sharder, and this assignment is persisted in a `shard` label on the objects themselves.
The sharder's role is critical during initial assignments but is not in the critical path for ongoing reconciliations.
All shards use filtered watches with label selectors specific to their instance ID to only watch and reconcile API objects that are assigned to the shard.

Concurrent reconciliations of a single object in multiple controller instances during object movements are prevented by following a dedicated handover protocol.
If the current shard is still available, it needs to acknowledge the movement and confirm that it stops reconciling the object.
For this, the sharder first adds a `drain` label to the object and waits until the current shard has removed both the `drain` and `shard` label.
Only then, the sharder sets the `shard` label to the instance ID of the new, desired shard.
If the current shard is not available, objects are immediately assigned to the new shard without setting the `drain` label first.

- implementation on controller-side
- implementation in controller-runtime, can be reused in other controllers based on controller-runtime
- cannot be reused for controllers not based on controller-runtime, or written in other programming languages
- watches are restricted to shard
- CPU and memory usage are distributed
- sharder controller required
- extra memory usage
- assignments on a per-object basis needs to many reconciliations and API requests
- especially on rolling updates
This design is a good step towards horizontally scalable Kubernetes controllers.
It achieves a good distribution of reconciliation work across available controller instances.
Also, the design enables dynamic scaling of the controller replicas during runtime by facilitating automatic rebalancing.
By using shard-specific label selectors for watch requests, the CPU and memory usage related to process watch events and caching objects is distributed well.

The design was implemented generically in the controller-runtime library.
With this, it can be reused in other controllers that are written in Go and based on controller-runtime.
While controllers that are not written in Go or don't use controller-runtime can leverage the same sharding mechanisms, the presented implementation cannot be reused in such controllers.
Another drawback of the design is the extra CPU and memory usage of the sharder's watches and caches for sharded API objects.
It requires watching all sharded API objects in one of the controller instance and thus causes CPU and memory usage that is proportional to the number of sharded objects.
With this, the scalability limitation of controllers is not resolved but only shifted to the active sharder instance (see [@fig:study-project-memory]).
Additionally, the assignments per object require many sharder reconciliations and API requests, especially during rolling updates.
This increases the load on the control plane.

![Study project memory usage by pod [@studyproject]](../assets/study-project-memory.pdf){#fig:study-project-memory}

## knative

Expand Down Expand Up @@ -130,7 +150,12 @@ See <https://kubevela.io/docs/platform-engineers/system-operation/controller-sha
- when assigned shard is down, objects are not moved
- static shard names?

## (Prometheus)
## Sharding on Workload Level?

- ingress controller sharding by route: <https://docs.openshift.com/container-platform/4.14/networking/ingress-sharding.html>
- machine learning applications: <https://medium.com/workday-engineering/implementing-a-fully-automated-sharding-strategy-on-kubernetes-for-multi-tenanted-machine-learning-4371c48122ae>

### Prometheus

- not controller-based sharding, but uses API machinery for service discovery
- `modulus` in service discovery config: <https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config>
Expand All @@ -140,7 +165,7 @@ See <https://kubevela.io/docs/platform-engineers/system-operation/controller-sha
- scaling down shards does not reshard data onto remaining instances, it must be manually moved
- scaling up shards does not reshard data, but it will continue to be available from the same instances

## (kube-state-metrics)
### kube-state-metrics

See <https://github.com/kubernetes/kube-state-metrics#horizontal-sharding>.

Expand Down
31 changes: 31 additions & 0 deletions content/bibliography.bib
Original file line number Diff line number Diff line change
Expand Up @@ -338,3 +338,34 @@ @misc{gardenerdocs
url = {https://gardener.cloud/},
urldate = {2024-01-02}
}

@inproceedings{bigtable,
title = {Bigtable: A Distributed Storage System for Structured Data},
author = {Fay Chang and Jeffrey Dean and Sanjay Ghemawat and Wilson C. Hsieh and Deborah A. Wallach and Mike Burrows and Tushar Chandra and Andrew Fikes and Robert E. Gruber},
year = {2006},
booktitle = {7th {USENIX} Symposium on Operating Systems Design and Implementation (OSDI)},
pages = {205--218}
}

@misc{cassandradocs,
title = {Cassandra Documentation},
author = {{The Apache Software Foundation}},
date = {2024},
url = {https://cassandra.apache.org/doc/4.1/},
urldate = {2024-01-02}
}

@inproceedings{dynamo,
author = {DeCandia, Giuseppe and Hastorun, Deniz and Jampani, Madan and Kakulapati, Gunavardhan and Lakshman, Avinash and Pilchin, Alex and Sivasubramanian, Swaminathan and Vosshall, Peter and Vogels, Werner},
title = {Dynamo: Amazon's Highly Available Key-Value Store},
year = {2007},
isbn = {9781595935915},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
booktitle = {Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles},
pages = {205–220},
numpages = {16},
keywords = {scalability, reliability, performance},
location = {Stevenson, Washington, USA},
series = {SOSP '07}
}

0 comments on commit 2835e63

Please sign in to comment.