From ee801aa39907fc6868bc79d56cf1ffcb5334c4ed Mon Sep 17 00:00:00 2001 From: Ellis Tarn Date: Wed, 16 Jun 2021 14:59:45 -0700 Subject: [PATCH] Added authors to existing design doc (#459) --- ...ESIGN.md => metrics-driven-autoscaling.md} | 2 + docs/deprecated/designs/scheduled_capacity.md | 178 +++++++++--------- docs/designs/aws-launch-templates-options.md | 2 +- docs/{aws => designs}/bin-packing.md | 1 + docs/designs/termination.md | 22 +-- 5 files changed, 102 insertions(+), 103 deletions(-) rename docs/deprecated/designs/{DESIGN.md => metrics-driven-autoscaling.md} (99%) rename docs/{aws => designs}/bin-packing.md (98%) diff --git a/docs/deprecated/designs/DESIGN.md b/docs/deprecated/designs/metrics-driven-autoscaling.md similarity index 99% rename from docs/deprecated/designs/DESIGN.md rename to docs/deprecated/designs/metrics-driven-autoscaling.md index 7b5f34ff2ecb..d8c3676d8e02 100644 --- a/docs/deprecated/designs/DESIGN.md +++ b/docs/deprecated/designs/metrics-driven-autoscaling.md @@ -1,5 +1,7 @@ # Metrics Driven Autoscaling +*Authors: ellistarn@* + Node Autoscaling (a.k.a. Cluster Autoscaling) is the process of continually adding and removing a cluster’s nodes to meet the resource demands of its pods. As users scale to increasingly large clusters, autoscaling becomes necessary for both practicality and cost reasons. While overprovisioning is a viable approach at smaller scales, it becomes prohibitively expensive as organizations grow. In response to increasing infrastructure costs, some users create manual processes to scale node groups, but this approach yields inefficient resource utilization and is error prone. Node autoscaling replaces these manual processes with automation. ## Overview diff --git a/docs/deprecated/designs/scheduled_capacity.md b/docs/deprecated/designs/scheduled_capacity.md index 0871955bc5f4..5e07d53b0418 100644 --- a/docs/deprecated/designs/scheduled_capacity.md +++ b/docs/deprecated/designs/scheduled_capacity.md @@ -1,8 +1,9 @@ # Scheduled Capacity Design +*Authors: njtran@* ## Introduction -Today, some Kubernetes users handle their workloads by scaling up and down in a recurring pattern. These patterns are -often indicative of some change in operational load and can come in the form of anything from a series of complex -scaling decisions to a one-off scale decision. +Today, some Kubernetes users handle their workloads by scaling up and down in a recurring pattern. These patterns are +often indicative of some change in operational load and can come in the form of anything from a series of complex +scaling decisions to a one-off scale decision. ## User Stories * As a user I can periodically scale up and scale down my resources @@ -11,31 +12,31 @@ scaling decisions to a one-off scale decision. * As a user I can see the current and future recommended states of my resources ## Background -The important parts of Karpenter to take note of will be the HorizontalAutoscaler and the MetricsProducer. For any -user-specified resource, the MetricsProducer will be responsible for parsing the user input, calculating the metric -recommendation, and exposing it to the metrics endpoint. The HorizontalAutoscaler will be responsible for sending the +The important parts of Karpenter to take note of will be the HorizontalAutoscaler and the MetricsProducer. For any +user-specified resource, the MetricsProducer will be responsible for parsing the user input, calculating the metric +recommendation, and exposing it to the metrics endpoint. The HorizontalAutoscaler will be responsible for sending the signals to scale the resource by using a `promql` query to grab the metric that the MetricsProducer has created. -The core of each MetricsProducer is a reconcile loop, which runs at a pre-configured interval of time, and a record -function. The reconciliation ensures the metric is always being calculated, while the record function makes the data +The core of each MetricsProducer is a reconcile loop, which runs at a pre-configured interval of time, and a record +function. The reconciliation ensures the metric is always being calculated, while the record function makes the data available to the Prometheus server at every iteration of the loop. ![](../images/scheduled-capacity-dataflow-diagram.png) -While a HorizontalAutoscaler can only scale one resource, the metric that a MetricsProducer makes available can be used -by any amount of HorizontalAutoscalers. In addition, with a more complex `promql` -[query](https://prometheus.io/docs/prometheus/latest/querying/basics/), a user can also use a HorizontalAutoscaler to -scale based off multiple MetricsProducers. +While a HorizontalAutoscaler can only scale one resource, the metric that a MetricsProducer makes available can be used +by any amount of HorizontalAutoscalers. In addition, with a more complex `promql` +[query](https://prometheus.io/docs/prometheus/latest/querying/basics/), a user can also use a HorizontalAutoscaler to +scale based off multiple MetricsProducers. For more details, refer to [Karpenter’s design doc](DESIGN.md). ## Design -This design encompasses the `ScheduleSpec` and `ScheduledCapacityStatus` structs. The spec corresponds to the user -input specifying the scheduled behaviors. The status will be used as a way for the user to check the state of the -metric through `kubectl` commands. +This design encompasses the `ScheduleSpec` and `ScheduledCapacityStatus` structs. The spec corresponds to the user +input specifying the scheduled behaviors. The status will be used as a way for the user to check the state of the +metric through `kubectl` commands. ### Metrics Producer Spec -The `ScheduleSpec` is where the user will specify the times in which a schedule will activate and recommend what the +The `ScheduleSpec` is where the user will specify the times in which a schedule will activate and recommend what the value of the metric should be. ```go @@ -62,23 +63,23 @@ type ScheduledBehavior struct { // Pattern is a strongly-typed version of crontabs type Pattern struct { - // When minutes or hours are left out, they are assumed to match to 0 - Minutes *string `json:"minutes,omitempty"` - Hours *string `json:"hours,omitempty"` - // When Days, Months, or Weekdays are left out, - // they are represented by wildcards, meaning any time matches - Days *string `json:"days,omitempty"` + // When minutes or hours are left out, they are assumed to match to 0 + Minutes *string `json:"minutes,omitempty"` + Hours *string `json:"hours,omitempty"` + // When Days, Months, or Weekdays are left out, + // they are represented by wildcards, meaning any time matches + Days *string `json:"days,omitempty"` // List of 3-letter abbreviations i.e. Jan, Feb, Mar - Months *string `json:"months,omitempty"` - // List of 3-letter abbreviations i.e. "Mon, Tue, Wed" + Months *string `json:"months,omitempty"` + // List of 3-letter abbreviations i.e. "Mon, Tue, Wed" Weekdays *string `json:"weekdays,omitempty"` } ``` -The spec below details how a user might configure their scheduled behaviors. The picture to the right corresponds to +The spec below details how a user might configure their scheduled behaviors. The picture to the right corresponds to the configuration. -This configuration is scaling up for 9-5 on weekdays (red), scaling down a little at night (green), and then scaling +This configuration is scaling up for 9-5 on weekdays (red), scaling down a little at night (green), and then scaling down almost fully for the weekends (blue). ![](../images/scheduled-capacity-example-schedule-graphic.png) ```yaml @@ -101,7 +102,7 @@ spec: hours: 9 // Scale up on Weekdays for usual traffic - replicas: 3 - start: + start: weekdays: Mon,Tue,Wed,Thu,Fri hours: 9 end: @@ -109,7 +110,7 @@ spec: hours: 17 // Scale down on weekday evenings but not as much as on weekends - replicas: 2 - start: + start: weekdays: Mon,Tue,Wed,Thu,Fri hours: 17 end: @@ -118,30 +119,30 @@ spec: ``` ### Metrics Producer Status Struct -The `ScheduledCapacityStatus` can be used to monitor the MetricsProducer. The results of the algorithm will populate -this struct at every iteration of the reconcile loop. A user can see the values of this struct with +The `ScheduledCapacityStatus` can be used to monitor the MetricsProducer. The results of the algorithm will populate +this struct at every iteration of the reconcile loop. A user can see the values of this struct with `kubectl get metricsproducers -oyaml`. ```go type ScheduledCapacityStatus struct { // The current recommendation - the metric the MetricsProducer is emitting - CurrentValue *int32 `json:"currentValue,omitempty"` + CurrentValue *int32 `json:"currentValue,omitempty"` // The time where CurrentValue will switch to NextValue - NextValueTime *apis.VolatileTime `json:"nextValueTime,omitempty"` - + NextValueTime *apis.VolatileTime `json:"nextValueTime,omitempty"` + // The next recommendation for the metric - NextValue *int32 `json:"nextValue,omitempty"` + NextValue *int32 `json:"nextValue,omitempty"` } ``` ## Algorithm Design -The algorithm will parse all behaviors and the start and end schedule formats. We find the `nextStartTime` and -`nextEndTime` for each of the schedules. These will be the times they next match in the future. +The algorithm will parse all behaviors and the start and end schedule formats. We find the `nextStartTime` and +`nextEndTime` for each of the schedules. These will be the times they next match in the future. We say a schedule matches if the following are all true: -* The current time is before or equal to the `nextEndTime` -* The `nextStartTime` is after or equal to the `nextEndTime` +* The current time is before or equal to the `nextEndTime` +* The `nextStartTime` is after or equal to the `nextEndTime` Based on how many schedules match: @@ -152,34 +153,34 @@ Based on how many schedules match: This algorithm and API choice are very similar to [KEDA’s Cron Scaler](https://keda.sh/docs/2.0/scalers/cron/). ## Strongly-Typed vs Crontabs -Most other time-based schedulers use Crontabs as their API. This section discusses why we chose against Crontabs and +Most other time-based schedulers use Crontabs as their API. This section discusses why we chose against Crontabs and how the two choices are similar. -* The [Cron library](https://github.com/robfig/cron) captures too broad of a scope for our use-case. - * In planning critical scaling decisions, freedom can hurt more than help. One malformed scale signal can cost the - user a lot more money, or even scale down unexpectedly. - * While our implementation will use the Cron library, picking a strongly-typed API will allows us to decide which +* The [Cron library](https://github.com/robfig/cron) captures too broad of a scope for our use-case. + * In planning critical scaling decisions, freedom can hurt more than help. One malformed scale signal can cost the + user a lot more money, or even scale down unexpectedly. + * While our implementation will use the Cron library, picking a strongly-typed API will allows us to decide which portions of the library we want to allow the users to configure. -* The wide range of functionality Cron provides is sometimes misunderstood +* The wide range of functionality Cron provides is sometimes misunderstood (e.g. [Crontab Pitfalls](#crontab-pitfalls)). - * Adopting Crontab syntax adopts its pitfalls, which can be hard to fix in the future. - * If users have common problems involving Cron, it is more difficult to fix than if they were problems specific to + * Adopting Crontab syntax adopts its pitfalls, which can be hard to fix in the future. + * If users have common problems involving Cron, it is more difficult to fix than if they were problems specific to Karpenter. -* Karpenter’s metrics signals are best described as level-triggered. Crontabs were created to describe when to trigger -Cronjobs, which is best described as edge-triggered. - * If a user sees Crontabs, they may assume that Karpenter is edge-triggered behind the scenes, which - implies certain [problems](https://hackernoon.com/level-triggering-and-reconciliation-in-kubernetes-1f17fe30333d) - with availability. +* Karpenter’s metrics signals are best described as level-triggered. Crontabs were created to describe when to trigger +Cronjobs, which is best described as edge-triggered. + * If a user sees Crontabs, they may assume that Karpenter is edge-triggered behind the scenes, which + implies certain [problems](https://hackernoon.com/level-triggering-and-reconciliation-in-kubernetes-1f17fe30333d) + with availability. * We want our users to infer correctly what is happening behind the scenes. - + ## Field Plurality and Configuration Bloat -While Crontabs allow a user to specify **ranges** and **lists** of numbers/strings, we chose to **only** allow a **list** of -numbers/strings. Having a start and stop configuration in the form of Crontabs can confuse the user if they use overly -complex configurations. Reducing the scope of their choices to just a list of values can make it clearer. +While Crontabs allow a user to specify **ranges** and **lists** of numbers/strings, we chose to **only** allow a **list** of +numbers/strings. Having a start and stop configuration in the form of Crontabs can confuse the user if they use overly +complex configurations. Reducing the scope of their choices to just a list of values can make it clearer. -It is important to allow a user to specify multiple values to ease configuration load. While simpler cases like below -are easier to understand, adding more Crontab aspects like skip values and ranges can be much harder to mentally parse -at more complex levels of planning. We want to keep the tool intuitive, precise, and understandable, so that users who +It is important to allow a user to specify multiple values to ease configuration load. While simpler cases like below +are easier to understand, adding more Crontab aspects like skip values and ranges can be much harder to mentally parse +at more complex levels of planning. We want to keep the tool intuitive, precise, and understandable, so that users who understand their workloads can easily schedule them. ```yaml @@ -195,7 +196,7 @@ spec: // This spec WILL NOT work according to the design. // Scale up on Weekdays for usual traffic - replicas: 7 - start: + start: weekdays: Mon-Fri hours: 9 months: Jan-Mar @@ -206,7 +207,7 @@ spec: // This spec WILL work according to the design. // Scale down on weekday evenings - replicas: 7 - start: + start: weekdays: Mon,Tue,Wed,Thu,Fri hours: 9 months: Jan,Feb,Mar @@ -219,54 +220,54 @@ spec: ## FAQ ### How does this design handle collisions right now? -* In the MVP, if a schedule ends when another starts, it will select the schedule that is starting. If more than one +* In the MVP, if a schedule ends when another starts, it will select the schedule that is starting. If more than one are starting/valid, then it will use the schedule that comes first in the spec. -* Look at the Out of Scope https://quip-amazon.com/zQ7mAxg0wNDC/Karpenter-Periodic-Autoscaling#ANY9CAbqSLH below for +* Look at the Out of Scope https://quip-amazon.com/zQ7mAxg0wNDC/Karpenter-Periodic-Autoscaling#ANY9CAbqSLH below for more details. ### How would a priority system work for collisions? -* Essentially, every schedule would have some associated Priority. If multiple schedules match to the same time, the -one with the higher priority will win. In the event of a tie, we resort to position in the spec. Whichever schedule is +* Essentially, every schedule would have some associated Priority. If multiple schedules match to the same time, the +one with the higher priority will win. In the event of a tie, we resort to position in the spec. Whichever schedule is configured first will win. ### How can I leverage this tool to work with other metrics? * Using this metric in tandem with others is a part of the Karpenter HorizontalAutoscaler. There are many possibilities, and it’s possible to do so with all metrics in prometheus, as long as they return an instant vector (a singular value). -* Let’s say a user is scaling based-off a queue (a metric currently supported by Karpenter). If they’d like to keep a -healthy minimum value regardless of the size of the queue to stay ready for an abnormally large batch of jobs, they can +* Let’s say a user is scaling based-off a queue (a metric currently supported by Karpenter). If they’d like to keep a +healthy minimum value regardless of the size of the queue to stay ready for an abnormally large batch of jobs, they can configure their HorizontalAutoscaler’s Spec.Metrics.Prometheus.Query field to be the line below. -`max(karpenter_queue_length{name="ml-training-queue"},karpenter_scheduled_capacity{name="schedules"})` +`max(karpenter_queue_length{name="ml-training-queue"},karpenter_scheduled_capacity{name="schedules"})` ### Is it required to use Prometheus? -* Currently, Karpenter’s design has a dependency on Prometheus. We use Prometheus to store the data that the core design +* Currently, Karpenter’s design has a dependency on Prometheus. We use Prometheus to store the data that the core design components (MetricsProducer, HorizontalAutoscaler, ScalableNodeGroup) use to communicate with each other. ### Why Karpenter HorizontalAutoscaler and MetricsProducer? Why not use the HPA? -* Karpenter’s design details why we have a CRD called HorizontalAutoscaler, and how our MetricsProducers complement -them. While there are a lot of similarities, there are key differences as detailed in the design +* Karpenter’s design details why we have a CRD called HorizontalAutoscaler, and how our MetricsProducers complement +them. While there are a lot of similarities, there are key differences as detailed in the design [here](../designs/DESIGN.md#alignment-with-the-horizontal-pod-autoscaler-api). ## Out of Scope - Additional Future Features -Our current design currently does not have a robust way to handle collisions and help visualize how the metric will look -over time. While these are important issues, their implementations will not be included in the MVP. +Our current design currently does not have a robust way to handle collisions and help visualize how the metric will look +over time. While these are important issues, their implementations will not be included in the MVP. ### Collisions -Collisions occur when more than one schedule matches to the current time. When this happens, the MetricsProducer cannot +Collisions occur when more than one schedule matches to the current time. When this happens, the MetricsProducer cannot emit more than one value at a time, so it must choose one value. * When could collisions happen past user error? * When a user wants a special one-off scale up request - * e.g. MyCompany normally has `x` replicas on all Fridays at 06:00 and `y` replicas on Fridays at 20:00, but + * e.g. MyCompany normally has `x` replicas on all Fridays at 06:00 and `y` replicas on Fridays at 20:00, but wants `z` replicas on Black Friday at 06:00 * For only this Friday, the normal Friday schedule and this special one-off request will conflict * Solutions for collision handling * Create a warning with the first times that a collision could happen - * Doesn’t decide functionality for users + * Doesn’t decide functionality for users * Does not guarantee it will be resolved * Associate each schedule with a priority which will be used in comparison to other colliding schedules * Requires users to rank each of their schedules, which they may want to change based on the time they collide @@ -274,7 +275,7 @@ emit more than one value at a time, so it must choose one value. * Use the order in which they’re specified in the spec **OR** * Default to the defaultReplicas -The only change to the structs from the initial design would be to add a Priority field in the ScheduledBehavior struct +The only change to the structs from the initial design would be to add a Priority field in the ScheduledBehavior struct as below. ```go type ScheduledBehavior struct { @@ -286,9 +287,9 @@ type ScheduledBehavior struct { ``` ### Configuration Complexity -When a user is configuring their resources, it’s easy to lose track of how the metric will look over time, especially -if a user may want to plan far into the future with many complex behaviors. Creating a tool to visualize schedules will -not only help users understand how their schedules will match up, but can ease concerns during the configuration +When a user is configuring their resources, it’s easy to lose track of how the metric will look over time, especially +if a user may want to plan far into the future with many complex behaviors. Creating a tool to visualize schedules will +not only help users understand how their schedules will match up, but can ease concerns during the configuration process. This can empower users to create even more complex schedules to match their needs. Possible Designs: @@ -300,7 +301,7 @@ Possible Designs: * Cons * Requires users to manually use it * Requires a lot more work to implement a UI to help create the YAML as opposed to just a tool to validate -* An extra function to be included as part of the MetricsProducerStatus +* An extra function to be included as part of the MetricsProducerStatus * Pros * Always available to see the visualization with kubectl commands * Cons @@ -308,16 +309,16 @@ Possible Designs: * Cannot use to check-as-you-go for configuration purposes ## Crontab Pitfalls -This design includes the choice to use a strongly-typed API due to cases where Crontabs do not act as expected. Below +This design includes the choice to use a strongly-typed API due to cases where Crontabs do not act as expected. Below is the most common misunderstanding. -* Let's say I have a schedule to trigger on the following dates: +* Let's say I have a schedule to trigger on the following dates: * Schedule A: First 3 Thursdays of January * Schedule B: The Friday of the last week of January and the first 2 weeks of February * Schedule C: Tuesday for every week until the end of March after Schedule B * This is how someone might do it * Schedule A - "* * 1-21 1 4" for the first three Thursdays of January - * Schedule B - "* * 22-31 1 5" for the last week of January and "* * 1-14 2 5" for the first two weeks of February + * Schedule B - "* * 22-31 1 5" for the last week of January and "* * 1-14 2 5" for the first two weeks of February * Schedule C - "* * 15-31 2 2" for the last Tuesdays in February and "* * * 3 2" for the Tuesdays in March * Problems with the above approach * Schedule A will match to any day in January that is in 1/1 to 1/21 or is a Thursday @@ -325,11 +326,6 @@ is the most common misunderstanding. * Schedule B’s second crontab will match to any day in February that is in 2/1 to 2/14 or is a Friday * Schedule C’s first crontab will match to any day in February that is in 2/15 to 2/31 or is a Tuesday * Schedule C’s second crontab is the only one that works as intended. -* The way that crontabs are implemented is if both Dom and Dow are non-wildcards (as they are above in each of the -crontabs except for Schedule C’s second crontab), then the crontab is treated as a match if **either** the Dom **or** Dow -matches. - - - - - +* The way that crontabs are implemented is if both Dom and Dow are non-wildcards (as they are above in each of the +crontabs except for Schedule C’s second crontab), then the crontab is treated as a match if **either** the Dom **or** Dow +matches. diff --git a/docs/designs/aws-launch-templates-options.md b/docs/designs/aws-launch-templates-options.md index 64d461907b4c..d8080d137daf 100644 --- a/docs/designs/aws-launch-templates-options.md +++ b/docs/designs/aws-launch-templates-options.md @@ -1,5 +1,5 @@ # AWS Launch Template Options - +*Authors: JacobGabrielson@* ## Intro This document presents some options for how the AWS-specific (cloud diff --git a/docs/aws/bin-packing.md b/docs/designs/bin-packing.md similarity index 98% rename from docs/aws/bin-packing.md rename to docs/designs/bin-packing.md index 938fc31d502d..30643c5b2aca 100644 --- a/docs/aws/bin-packing.md +++ b/docs/designs/bin-packing.md @@ -1,4 +1,5 @@ # Bin Packing Design Considerations +*Authors: prateekgogia@* > Note: this is not a final design; this is still in POC stage and > some things might change. diff --git a/docs/designs/termination.md b/docs/designs/termination.md index 9d1337fd3594..183312cf3fe9 100644 --- a/docs/designs/termination.md +++ b/docs/designs/termination.md @@ -1,7 +1,7 @@ # Karpenter Graceful Node Termination - +*Authors: njtran@* ## Overview -Karpenter's scale down implementation is currently a proof of concept. The reallocation controller implements two actions. First, nodes are elected for termination when there aren't any pods scheduled to them. Second, nodes are cordoned and drained and deleted. Node termination follows cordon and drain [best practices](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/). +Karpenter's scale down implementation is currently a proof of concept. The reallocation controller implements two actions. First, nodes are elected for termination when there aren't any pods scheduled to them. Second, nodes are cordoned and drained and deleted. Node termination follows cordon and drain [best practices](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/). This design explores improvements to the termination process and proposes the separation of this logic into a new termination controller, installed as part of Karpenter. @@ -23,11 +23,11 @@ The new termination process will begin with a node that receives a delete reques ![](../images/termination-state-machine.png) ### Triggering Termination -The current termination process acts on a reconcile loop. We will change the termination controller to watch nodes and manage the Karpenter [finalizer](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#finalizers), making it responsible for all node termination and pod eviction logic. +The current termination process acts on a reconcile loop. We will change the termination controller to watch nodes and manage the Karpenter [finalizer](https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#finalizers), making it responsible for all node termination and pod eviction logic. -Finalizers allow controllers to implement asynchronous pre-deletion hooks and are commonly used with CRDs like Karpenter’s Provisioners. Today, a user can call `kubectl delete node` to delete a node object, but will end up leaking the underlying instance by only deleting the node object in the cluster. We will use finalizers to gracefully terminate underlying instances before Karpenter provisioned nodes are deleted, preventing instance leaking. Relying on `kubectl` for terminations gives the user more control over their cluster and a Kubernetes-native way of deleting nodes - as opposed to the status quo of doing it manually in a cloud provider's console. +Finalizers allow controllers to implement asynchronous pre-deletion hooks and are commonly used with CRDs like Karpenter’s Provisioners. Today, a user can call `kubectl delete node` to delete a node object, but will end up leaking the underlying instance by only deleting the node object in the cluster. We will use finalizers to gracefully terminate underlying instances before Karpenter provisioned nodes are deleted, preventing instance leaking. Relying on `kubectl` for terminations gives the user more control over their cluster and a Kubernetes-native way of deleting nodes - as opposed to the status quo of doing it manually in a cloud provider's console. -We will additionally implement a Karpenter [Webhook](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/) to validate node deletion requests and add finalizers to nodes that have been cleared for deletion. If the request will not violate a Node Disruption Budget (discussed below) and Karpenter is installed, the webhook will add the Karpenter finalizer to nodes and then allow the deletion request to go through, triggering the workflow. +We will additionally implement a Karpenter [Webhook](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/) to validate node deletion requests and add finalizers to nodes that have been cleared for deletion. If the request will not violate a Node Disruption Budget (discussed below) and Karpenter is installed, the webhook will add the Karpenter finalizer to nodes and then allow the deletion request to go through, triggering the workflow. ### Eviction @@ -43,7 +43,7 @@ In the case where a user has enabled termination protection for their underlying ### User Configuration -The termination controller and associated webhooks will come installed with Karpenter, requiring no additional configuration on the user’s part. +The termination controller and associated webhooks will come installed with Karpenter, requiring no additional configuration on the user’s part. We will allow users to specify a `karpenter.sh/do-not-evict` label on their pods, guaranteeing that we will not evict certain pods. A node with `do-not-evict` pods will cordon but wait to drain until all `do-not-evict` pods are gone. This way, the cluster will continue to utilize its existing capacity until the `do-not-evict` pods terminate. Users can use Karpenter’s scheduling logic to colocate pods with this label onto similar nodes to load balance these pods. @@ -53,9 +53,9 @@ The termination controller will be able to drain and terminate multiple nodes in We introduce an optional cluster-scoped CRD, the Node Disruption Budget (NDB), a [Pod Disruption Budget (PDB)](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) for nodes. A user can scope an NDB to Karpenter provisioned nodes through a label selector, since nodes are created with their Provisioner’s labels. `Unavailable` nodes will be `NotReady` or have `metadata.DeletionTimestamp` set. `Available` nodes will be `Ready`. -A termination is allowed if at least minAvailable nodes selected by a selector will still be available after the termination. For example, you can prevent all terminations by specifying “100%”. A termination is also allowed if at most maxUnavailable nodes selected by selector are unavailable after the termination. For example, one can prevent all terminations by specifying 0. The `minAvailable` and `maxUnavailable` fields are mutually exclusive. +A termination is allowed if at least minAvailable nodes selected by a selector will still be available after the termination. For example, you can prevent all terminations by specifying “100%”. A termination is also allowed if at most maxUnavailable nodes selected by selector are unavailable after the termination. For example, one can prevent all terminations by specifying 0. The `minAvailable` and `maxUnavailable` fields are mutually exclusive. -Note that this is an experimental idea, and will require robustness improvements for future features such as defragmentation, over-provisioning, and more. +Note that this is an experimental idea, and will require robustness improvements for future features such as defragmentation, over-provisioning, and more. [PodDisruptionBudgetSpec](https://pkg.go.dev/k8s.io/api/policy/v1beta1#PodDisruptionBudgetSpec) for reference. @@ -99,9 +99,9 @@ Karpenter is a node autoscaler, so it does not take responsibility for maintaini If a user wants to manually delete a Karpenter provisioned node, this design allows the user to do it safely if Karpenter is installed. Otherwise, the user will need to clean up their resources themselves. -Kubernetes is unable to delete nodes that have finalizers on them. For this reason, we chose to add the Karpenter finalizer only after a delete request is validated. Yet, in the rare case that Karpenter is uninstalled while a node deletion request is processing, to finish terminating the node, the user must either: reinstall Karpenter to resume the termination logic or remove the Karpenter finalizer from the node, allowing the API Server to delete then node. +Kubernetes is unable to delete nodes that have finalizers on them. For this reason, we chose to add the Karpenter finalizer only after a delete request is validated. Yet, in the rare case that Karpenter is uninstalled while a node deletion request is processing, to finish terminating the node, the user must either: reinstall Karpenter to resume the termination logic or remove the Karpenter finalizer from the node, allowing the API Server to delete then node. -If a node is unable to become ready for `15 minutes`, we will terminate the node. As we don’t have the ability or responsibility to diagnose the problem, we would worst case terminate a soon-to-be-healthy node. In this case, the orphaned pod(s) would trigger creation of another node. +If a node is unable to become ready for `15 minutes`, we will terminate the node. As we don’t have the ability or responsibility to diagnose the problem, we would worst case terminate a soon-to-be-healthy node. In this case, the orphaned pod(s) would trigger creation of another node. ## Appendix @@ -127,6 +127,6 @@ In the future, we may implement the following to account for more scale down sit ### Asynchronous Termination Clarifications -When pods are requested to be evicted, they are put into an Eviction Queue specific to the PDB handling the pods. The controller will call evictions serially that run asynchronously and exponentially back off and retry if they fail. +When pods are requested to be evicted, they are put into an Eviction Queue specific to the PDB handling the pods. The controller will call evictions serially that run asynchronously and exponentially back off and retry if they fail. Finalizers are also handled asynchronously. Adding in a Karpenter finalizer doesn’t prevent or delay other controllers from executing finalizer logic on the same node.