diff --git a/multidimensional-pod-autoscaler/AEP.md b/multidimensional-pod-autoscaler/AEP.md new file mode 100644 index 000000000000..b06b2110c5ac --- /dev/null +++ b/multidimensional-pod-autoscaler/AEP.md @@ -0,0 +1,542 @@ +# AEP-5342: Multi-dimensional Pod Autoscaler + +AEP - Autoscaler Enhancement Proposal + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories-optional) + - [A New MPA Framework with Reinforcement Learning](#a-new-mpa-framework-with-reinforcement-learning) + - [Different Scaling Actions for Different Types of Resources](#different-scaling-actions-for-different-types-of-resources) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Unit Tests](#unit-tests) + - [Integration Tests](#integration-tests) + - [End-to-end Tests](#end-to-end-tests) + - [Graduation Criteria](#graduation-criteria) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) AEP approvers have approved the AEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +Currently, Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) control the scaling actions separately as independent controllers to determine the resource allocation for a containerized application. +Due to the independence of these two controllers, when they are configured to optimize the same target, e.g., CPU usage, they can lead to an awkward situation where HPA tries to spin more pods based on the higher-than-threshold CPU usage while VPA tries to squeeze the size of each pod based on the lower CPU usage (after scaling out by HPA). +The final outcome would be a large number of small pods created for the workloads. +Manual fine-tuning the timing to do vertical/horizontal scaling and prioritization are usually needed for synchronization of the HPA and VPA. + +We propose a Multi-dimensional Pod Autoscaling (MPA) framework that combines the actions of vertical and horizontal autoscaling in a single action but separates the actuation completely from the controlling algorithms. +It consists of three controllers (i.e., a recommender, an updater, and an admission controller) and an MPA API (i.e., a CRD object or CR) that connects the autoscaling recommendations to actuation. +The multidimensional scaling algorithm is implemented in the recommender. +The scaling decisions derived from the recommender are stored in the MPA object. +The updater and the admission controller retrieve those decisions from the MPA object and actuate those vertical and horizontal actions. +Our proposed MPA (with the separation of recommendations from actuation) allows developers to replace the default recommender with their alternative customized recommender, so developers can provide their own recommender implementing advanced algorithms that control both scaling actions across different resource dimensions. + +## Motivation + +To scale application Deployments, Kubernetes supports both horizontal and vertical scaling with a Horizontal Pod Autoscaler (HPA) and a Vertical Pod Autoscaler (VPA), respectively. +Currently, [HPA] and [VPA] work separately as independent controllers to determine the resource allocation of a containerized application. +- HPA determines the number of replicas for each Deployment of an application with the aim of automatically scaling the workload to match demand. The HPA controller, running within the Kubernetes control plane, periodically adjusts the desired scale of its target (e.g., a Deployment) to match observed metrics such as average CPU utilization, average memory utilization, or any other custom metric the users specify (e.g., the rate of client requests per second or I/O writes per second). The autoscaling algorithm that the HPA controller uses is based on the equation `desired_replicas = current_replicas * (current_metric_value / desired_metric_value)`. +- VPA determines the size of containers, namely CPU and Memory Request and Limit. The primary goal of VPA is to reduce maintenance costs and improve the utilization of cluster resources. When configured, it will set the Request and Limit automatically based on historical usage and thus allow proper scheduling onto nodes so that the appropriate resource amount is available for each replica. It will also maintain ratios between limits and requests that were specified in the initial container configuration. + +When using HPA and VPA together to both reduce resource usage and guarantee application performance, VPA resizes pods based on their measured resource usage, and HPA scales in/out based on the customer application performance metric, and their logic is entirely ignorant of each other. +Due to the independence of these two controllers, they can lead to an awkward situation where VPA tries to squeeze the pods into smaller sizes based on their measured utilization. +Still, HPA tries to scale out the applications to improve the customized performance metrics. +It is also [not recommended] to use HPA together with VPA for CPU or memory metrics. +Therefore, there is a need to combine the two controllers so that horizontal and vertical scaling decisions are made in combination for an application to achieve both objectives, including resource efficiency and the application service-level objectives (SLOs)/performance goals. +However, existing VPA/HPA designs cannot accommodate such requirements. +Manual fine-tuning the timing or frequency to do vertical/horizontal scaling and prioritization are usually needed for synchronization of the HPA and VPA. + +[HPA]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ +[VPA]: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler +[not recommended]: https://cloud.google.com/kubernetes-engine/docs/concepts/horizontalpodautoscaler + +### Goals + +- Design and implement a holistic framework with a set of controllers to achieve multi-dimensional pod autoscaling (MPA). +- Separate the decision actuation from recommendations for both horizontal and vertical autoscaling, which enables users to replace the default recommender with their customized recommender. +- Re-use existing HPA and VPA libraries as much as possible in MPA. + +### Non-Goals + +- Design of new multi-dimensional pod autoscaling algorithms. Although this proposal will enable alternate recommenders, no alternate recommenders will be created as part of this proposal. +- Rewrite functionalities that have been implemented with existing HPA and VPA. +- This proposal will not support running multiple recommenders for the same MPA object. Each MPA object is supposed to use only one recommender. + +## Proposal +### User Stories +#### A New MPA Framework with Reinforcement Learning + +Many studies in research show that combined horizontal and vertical scaling can guarantee application performance with better resource efficiency using advanced algorithms such as reinforcement learning [1, 2]. These algorithms cannot be used with existing HPA and VPA frameworks. A new framework (MPA) is needed to combine horizontal and vertical scaling actions and separate the actuation of scaling actions from the autoscaling algorithms. The new MPA framework will work for all workloads on Kubernetes. + +[1] Haoran Qiu, Subho S. Banerjee, Saurabh Jha, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer (2020). FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2020). + +[2] Haoran Qiu, Weichao Mao, Archit Patke, Chen Wang, Hubertus Franke, Zbigniew T. Kalbarczyk, Tamer Başar, Ravishankar K. Iyer (2022). SIMPPO: A Scalable and Incremental Online Learning Framework for Serverless Resource Management. In Proceedings of the 13th ACM Symposium on Cloud Computing (SoCC 2022). + +#### Different Scaling Actions for Different Types of Resources + +For certain workloads, to ensure a custom metric (e.g., throughput or request-serving latency), horizontal scaling typically controls the CPU resources effectively, and vertical scaling is typically effective in increasing or decreasing the allocated memory capacity per pod. Thus, there is a need to control different types of resources at the same time using different scaling actions. Existing VPA and HPA can control these separately. However, they cannot achieve the same objective, e.g., guarantee a custom metric within an SLO target, by controlling both dimensions with different resource types independently. For example, they can lead to an awkward situation where HPA tries to spin more pods based on the higher-than-threshold CPU usage while VPA tries to squeeze the size of each pod based on the lower memory usage (after scaling out by HPA). In the end, there will be a large number of small pods created for the workloads. + +## Design Details + +Our proposed MPA framework consists of three controllers (i.e., a recommender, an updater, and an admission controller) and an MPA API (i.e., a CRD object or CR) that connects the autoscaling recommendations to actuation. The figure below describes the architectural overview of the proposed MPA framework. + +[](./kep-imgs/mpa-design.png "MPA Design Overview") + +**MPA API.** Application owners specify the autoscaling configurations which include: + +1. whether they only want to know the recommendations from MPA or they want MPA to directly actuate the autoscaling decisions; +2. application SLOs (e.g., in terms of latency or throughput) if there are; +3. any custom metrics if there are; and +4. other autoscaling configurations that exist in HPA and VPA (e.g., desired resource utilizations, container update policies, min and max number of replicas). + +MPA API is also responsible for connecting the autoscaling actions generated from the MPA Recommender to MPA Admission Controller and Updater which actually execute the scaling actions. MPA API is created based on the [multidimensional Pod scaling service] (not open-sourced) provided by Google. MPA API is a Custom Resource Definition (CRD) in Kubernetes and each MPA instance is a CR. MPA CR keeps track of recommendations on target requests and target replica numbers. + +[multidimensional Pod scaling service]: https://cloud.google.com/kubernetes-engine/docs/how-to/multidimensional-pod-autoscaling + +**Metrics APIs.** The Metrics APIs serve both default metrics or custom metrics associated with any Kubernetes objects. Custom metrics could be the application latency, throughput, or any other application-specific metrics. HPA already consumes metrics from such [a variety of metric APIs] (e.g., `metrics.k8s.io` API for resource metrics provided by metrics-server, `custom.metrics.k8s.io` API for custom metrics provided by "adapter" API servers provided by metrics solution vendors, and the `external.metrics.k8s.io` API for external metrics provided by the custom metrics adapters as well. A popular choice for the metrics collector is Prometheus. The metrics are then used by the MPA Recommender for making autoscaling decisions. + +[a variety of metric APIs]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-metrics-apis + +**MPA Recommender.** MPA Recommender retrieves the time-indexed measurement data from the Metrics APIs and generates the vertical and horizontal scaling actions. The actions from the MPA Recommender are then updated in the MPA API object. The autoscaling behavior is based on user-defined configurations. Users can implement their own recommenders as well. + +**MPA Updater.** MPA Updater will update the number of replicas in the deployment and evict the eligible pods for vertical scaling. + +**MPA Admission-Controller.** If users intend to directly execute the autoscaling recommendations generated from the MPA Recommender, the MPA Admission-Controller will update the deployment configuration (i.e., the size of each replica) and configure the rolling update to the Application Deployment. + +### Action Actuation Implementation + +To actuate the decisions without losing availability, we plan to: + +1. evict pods with min-replicas configured and update Pod sizes with the web-hooked admission controller (for vertical scaling), and +2. add or remove replicas (for horizontal scaling). + +We use a web-hooked admission controller to manage vertical scaling because if the actuator directly updates the vertical scaling configurations through deployment, it will potentially overload etcd (as vertical scaling might be quite frequent). +MPA Admission Controller intercepts Pod creation requests and rewrites the request by applying recommended resources to the Pod spec. +We do not use the web-hooked admission controller to manage the horizontal scaling as it could slow down the pod creation process. +In the future when the [in-place vertical resizing](https://github.com/kubernetes/enhancements/issues/1287) is enabled, we can enable the option of in-place vertical resizing while keeping the web-hooked admission controller for eviction-based vertical resizing as an option as well. + +[](./kep-imgs/mpa-action-actuation.png "MPA Action Actuation") + +Pros: +- Vertical scaling is handled by webhooks to avoid overloading etcd +- Horizontal scaling is handled through deployment to avoid extra overhead by webhooks +- Authentication and authorization for vertical scaling are handled by admission webhooks +- Recommendation and the actuation are completely separated + +Cons: +- Webhooks introduce extra overhead for vertical scaling operations (can be avoided after in-place resizing of pod is enabled without eviction) +- Vertical and horizontal scaling executions are separated (can be avoided after in-place resizing of pod is enabled without eviction) +- State changes in pod sizes are not persisted (too much to keep in etcd, could use Prometheus to store pod state changes) + +### Action Recommendation Implementation + +To generate the vertical scaling action recommendation, we reuse VPA libraries as much as possible to implement scaling algorithm integrated with the newly generated MPA API code. +To do that, we need to update accordingly the code which read and update the VPA objects to be interacting with the MPA objects. +To generate the horizontal scaling action recommendation, we reuse HPA libraries, integrating with the MPA API code, to reads and updates the MPA objects. +We integrate vertical and horizontal scaling in a single feedback cycle. +As an intitial solution, vertical scaling and horizontal scaling is performed alternatively (vertical scaling first). +Vertical scaling will scale the CPU and memory allocations based on the historical usage; and horizontal scaling will scale the number of replicas based on either CPU utilization or a custom metric. +In the future, we can consider more complex way of prioritization and conflict resolution. +The separation of recommendation and actuation allows customized recommender to be used to replace the default recommender. +For example, users can plug-in their RL-based controller to replace the MPA recommender, receiving measurements from the Metrics Server and modifying the MPA objects directly to give recommendations. + +The implementation of the MPA framework (the backend) is based on the existing HPA and VPA codebase so that it only requires minimum code maintenance. +Reused Codebase References: +- HPA: https://github.com/kubernetes/kubernetes/tree/master/pkg/controller/podautoscaler +- VPA: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler + +### MPA API Object + +We reuse the CR definitions from the [MultidimPodAutoscaler](https://cloud.google.com/kubernetes-engine/docs/how-to/multidimensional-pod-autoscaling) object developed by Google. +`MultidimPodAutoscaler` is the configuration for multi-dimensional Pod autoscaling, which automatically manages Pod resources and their count based on historical and real-time resource utilization. +MultidimPodAutoscaler has two main fields: `spec` and `status`. + +#### MPA Object + +``` +apiVersion: autoscaling.gke.io/v1beta1 +kind: MultidimPodAutoscaler +metadata: + name: my-autoscaler +# MultidimPodAutoscalerSpec +spec: + scaleTargetRef: + apiVersion: apps/v1 + kind: Deployment + name: my-target + policy: + updateMode: Auto + goals: + metrics: + - type: Resource + resource: + # Define the target CPU utilization request here + name: cpu + target: + type: Utilization + averageUtilization: target-cpu-util + constraints: + global: + minReplicas: min-num-replicas + maxReplicas: max-num-replicas + containerControlledResources: [ memory, cpu ] # Added cpu here as well + container: + - name: '*' # either a literal name, or "*" to match all containers + # this is not a general wildcard match + # Define boundaries for the memory request here + requests: + minAllowed: + memory: min-allowed-memory + maxAllowed: + memory: max-allowed-memory + # Define the recommender to use here + recommenders: + - name: my-recommender + +# MultidimPodAutoscalerStatus +status: + lastScaleTime: timestamp + currentReplicas: number-of-replicas + desiredReplicas: number-of-recommended-replicas + recommendation: + containerRecommendations: + - containerName: name + lowerBound: lower-bound + target: target-value + upperBound: upper-bound + conditions: + - lastTransitionTime: timestamp + message: message + reason: reason + status: status + type: condition-type + currentMetrics: + - type: metric-type + value: metric-value +``` + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +#### Unit Tests + + + + + + + +Unit tests are located at each controller package. + +#### Integration Tests + + + + + +Integration tests are to be added in the beta version. + +#### End-to-End Tests + + + + + +End-to-end tests are to be added in the beta version. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +#### How can this feature be enabled / disabled in a live cluster? + +MPA can be enabled by checking the prerequisite and executing `./deploy/mpa-up.sh`. + +#### Does enabling the feature change any default behavior? + +No. + +#### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +MPA can be disabled by executing `./deploy/mpa-down.sh`. + +#### What happens if we reenable the feature if it was previously rolled back? + +No impact will happen because everytime MPA is enabled it is a full new reset and restart of MPA. + +#### Are there any tests for feature enablement/disablement? + +End-to-end test of MPA will be included in the beta version. + +### Dependencies + + + +#### Does this feature depend on any specific services running in the cluster? + +MPA relies on cluster-level `metrics.k8s.io` API (for example, from [metrics-server](https://github.com/kubernetes-sigs/metrics-server)) +For the evict-and-replace mechanism, the API server needs to support the MutatingAdmissionWebhook API. + +### Scalability + + + +#### Will enabling / using this feature result in any new API calls? +No, replacing HPA/VPA with MPA only translates the way how recommendations are generated (separation of recommendation from actuation). +The original API calls used by HPA/VPA are reused by MPA and no new API calls are used by MPA. + +#### Will enabling / using this feature result in introducing new API types? +Yes, MPA introduces a new Custom Resource `MultidimPodAutoscaler`, similar to `VerticalPodAutoscaler`. + +#### Will enabling / using this feature result in any new calls to the cloud provider? +No. + +#### Will enabling / using this feature result in increasing size or count of the existing API objects? +No. It will not affect any existing API objects. + +#### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +No. To the best of our knowledge, it will not cause any increasing time of [existing SLIs/SLOs](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md). + +#### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +No. + +#### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +No. + + + +#### Will enabling / using this feature result in introducing new API types? + + + +#### Will enabling / using this feature result in any new calls to the cloud provider? + + + +#### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +#### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +#### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +### Troubleshooting + + + +#### How does this feature react if the API server and/or etcd is unavailable? + +#### What are other known failure modes? + + + +#### What steps should be taken if SLOs are not being met to determine the problem? + +## Alternatives + + + +### MPA as a Recommender Only + +An alternative option is to have MPA just as a recommender. +For VPA, based on the support of the customized recommender, MPA can be implemented as a recommender to write to a VPA object. Then VPA updater and admission controller will actuate the recommendation. +For HPA, additional support for alternative recommenders is needed so MPA can write scaling recommendations to the HPA object as well. + +- Pros: + - Less work and easier maintenance in the future + - Simple especially when vertical and horizontal are two completely independent control loops +- Cons: + - Additional support from HPA (enabling customized recommenders) is needed which requires update in the upstream Kubernetes + - Hard to coordinate/synchronize when horizontal and vertical scaling states and decisions are kept in different places (i.e., HPA and VPA object) + +### Google GKE's Approach of MPA + +In this [alternative approach](https://cloud.google.com/kubernetes-engine/docs/how-to/multidimensional-pod-autoscaling) (non-open-sourced), a `MultidimPodAutoscaler` object modifies memory or/and CPU requests and adds replicas so that the average utilization of each replica matches your target utilization. +The MPA object will be translated to VPA and HPA objects so at the end there are two *independent* controllers managing the vertical and horizontal scaling application deployment. diff --git a/multidimensional-pod-autoscaler/kep-imgs/mpa-action-actuation.png b/multidimensional-pod-autoscaler/kep-imgs/mpa-action-actuation.png new file mode 100644 index 000000000000..95e2d687b6aa Binary files /dev/null and b/multidimensional-pod-autoscaler/kep-imgs/mpa-action-actuation.png differ diff --git a/multidimensional-pod-autoscaler/kep-imgs/mpa-design.png b/multidimensional-pod-autoscaler/kep-imgs/mpa-design.png new file mode 100644 index 000000000000..2090f97b5dd3 Binary files /dev/null and b/multidimensional-pod-autoscaler/kep-imgs/mpa-design.png differ