From e17e8d0e06e1994f0e2db406e3683a1dd70e3edd Mon Sep 17 00:00:00 2001 From: Kensei Nakada Date: Mon, 18 Dec 2023 16:00:03 +0900 Subject: [PATCH] improve documentations for end users --- README.md | 57 +++++--- api/v1beta3/tortoise_types.go | 2 +- docs/{configuration.md => admin-guide.md} | 8 +- docs/concept.md | 50 ------- docs/emergency.md | 9 +- docs/horizontal.md | 21 ++- docs/user-guide.md | 155 ++++++++++++++++++++++ 7 files changed, 215 insertions(+), 87 deletions(-) rename docs/{configuration.md => admin-guide.md} (98%) delete mode 100644 docs/concept.md create mode 100644 docs/user-guide.md diff --git a/README.md b/README.md index 3b0a9e53..ca31ea38 100644 --- a/README.md +++ b/README.md @@ -1,20 +1,12 @@ -# tortoise +# Tortoise Tortoise -Tortoise, they are living in the Kubernetes cluster. - -Tortoise, you need to feed only very few parameters to them. - -Tortoise, they will soon start to eat historical usage data of Pods. - -Tortoise, once you start to live with them, you no longer need to configure autoscaling by yourself. +Get a cute Tortoise into your Kubernetes garden and say goodbye to the days optimizing your rigid autoscalers. ## Install -Tortoise, you cannot get it from the breeder. - -Tortoise, you need to get it from GitHub instead. +You cannot get it from the breeder, you need to get it from GitHub instead. ```shell # Install CRDs into the K8s cluster specified in ~/.kube/config. @@ -23,41 +15,64 @@ make install make deploy ``` -Tortoise, you don't need a rearing cage, but need VPA in your Kubernetes cluster before installing it. +You don't need a rearing cage, but need VPA in your Kubernetes cluster before installing it. + +## Motivation + +Many developers are working in Mercari, and not all of them are the experts of Kubernetes. +The platform has many tools and guides to simplify the task of optimizing resource requests, +but it takes a lot of human effort because the situation around the applications gets changed very frequently and we have to keep optimizing them every time. +(e.g., the implementation change could change the resource consumption, the amount of traffic could be changed, etc) + +Also, there are another important component to be optimized for the optimization, which is HorizontalPodAutoscaler. +It’s not a simple problem which we just set the target utilization as high as possible – +there are many scenarios where the actual resource utilization doesn’t reach the target resource utilization +(because of multiple containers, minReplicas, container’s size etc). + +To reduce the human effort to keep optimizing the workloads, +the platform team start to have Tortoise , which is designed to simplify the interface of autoscaling. + +It aims to move the responsibility of optimizing the workloads from the application teams to tortoises. +Application teams just need to set up Tortoise, and the platform team will never bother them again for the resource optimization - +all actual optimization is done by Tortoise automatically. ## Usage -Tortoise, they only need the deployment name. +Tortoise has a very simple interface: ```yaml -apiVersion: autoscaling.mercari.com/v1beta2 +apiVersion: autoscaling.mercari.com/v1beta3 kind: Tortoise metadata: name: lovely-tortoise namespace: zoo spec: - updateMode: Auto + updateMode: Auto targetRefs: scaleTargetRef: kind: Deployment name: sample ``` -Tortoise, then they'll prepare/keep adjusting HPA and VPA to achieve efficient autoscaling based on the past behavior of the workload. +Yet, beneath its unassuming shell, lies a wealth of historical resource usage data, cunningly harnessed +to deftly orchestrate HPA and VPA with finely-tuned parameters. + +Please refer to [User guide](./docs/user-guide.md) for other parameters. ## Documentations -- [Concept](./docs/concept.md): describes a brief overview of tortoise. -- [Horizontal scaling](./docs/horizontal.md): describes how the Tortoise does the horizontal autoscaling. -- [Vertical scaling](./docs/vertical.md): describes how the Tortoise does the vertical autoscaling. +- [User guide](./docs/user-guide.md): describes a minimum knowledge that the end-users have to know, +and how they can configure Tortoise so that they can let tortoises autoscale their workloads. +- [Admin guide](./docs/admin-guide.md): describes how the cluster admin can configure the global behavior of tortoise. - [Emergency mode](./docs/emergency.md): describes the emergency mode. -- [Configurations for admin](./docs/configuration.md): describes how the cluster admin can configure the global behavior via the configuration file. +- [Horizontal scaling](./docs/horizontal.md): describes how the Tortoise does the horizontal autoscaling internally. +- [Vertical scaling](./docs/vertical.md): describes how the Tortoise does the vertical autoscaling internally. - [Technically details](./docs/internal.md): describes the technically details of Tortoise. (mostly for the contributors) - [Contributor guide](./docs/contributor-guide.md): describes other stuff for the contributor. (testing etc) ## API definition -- [Tortoise](./api/v1beta2/tortoise_types.go) +- [Tortoise](./api/v1beta3/tortoise_types.go) ## Contribution diff --git a/api/v1beta3/tortoise_types.go b/api/v1beta3/tortoise_types.go index 5fa4b0d6..f16f0571 100644 --- a/api/v1beta3/tortoise_types.go +++ b/api/v1beta3/tortoise_types.go @@ -141,7 +141,7 @@ type TargetRefs struct { // HorizontalPodAutoscalerName is the name of the target HPA. // The target of this HPA should be the same as the ScaleTargetRef above. // The target HPA should have the ContainerResource type metric that refers to the container resource utilization. - // Please check out the document for more detail: https://github.com/mercari/tortoise/blob/master/docs/horizontal.md#supported-metrics-in-hpa + // Please check out the document for more detail: https://github.com/mercari/tortoise/blob/master/docs/horizontal.md#attach-your-hpa // Also, note that you must not edit the HPA directly after you attach the HPA to the tortoise of Auto mode. // Even if you edit your HPA in that case, tortoise will overwrite the HPA with the metrics/values. // diff --git a/docs/configuration.md b/docs/admin-guide.md similarity index 98% rename from docs/configuration.md rename to docs/admin-guide.md index 5216545c..ffbee339 100644 --- a/docs/configuration.md +++ b/docs/admin-guide.md @@ -1,9 +1,11 @@ -## Configuration for admin +## Admin guide Tortoise -The cluster admin can set the global configurations via the configuration file. -The configuration file is passed via `--config` flag. +Tortoise exposes a lot of flags to configure tortoises behavior in the cluster. + +The cluster admin can set the global configurations via the configuration file, +and the configuration file is passed via `--config` flag. ``` RangeOfMinMaxReplicasRecommendationHours: The time (hours) range of minReplicas and maxReplicas recommendation (default: 1) diff --git a/docs/concept.md b/docs/concept.md deleted file mode 100644 index cc4881a5..00000000 --- a/docs/concept.md +++ /dev/null @@ -1,50 +0,0 @@ -## Concept - -Tortoise - -The resource management in Kubernetes world is difficult today, -there are many options on your table (HPA, VPA, KEDA, etc) at first, -there are many parameters on them, -and you want to reduce the wasted resources as long as possible with any of them, -but at the same time, you need to keep the reliability of workloads. - -Tortoise, it aims to solve such complicated situation by system -- give recommended values to Autoscalers from the controller and keep update them. -- use historical resource usage of target workloads to calculate the recommended values on parameters while ensuring the safety. -- expose only few configurations to users. - -### General design - -We only allow users to configure: -- The way to do autoscaling (vertical or horizontal) for each container. - - In most cases, it should be OK to leave this configuration empty. Tortoise will use `Horizontal` for CPU and `Vertical` for memory. -- The minimum amount of resources given to each container. (optional) - - In most cases, it should be OK to leave this configuration empty as well. Tortoise will ensure safety of the resource reduction based on the values suggested by VPA. - - But, the application developers may want to increase the resource request before they bring something big to workloads which will affect the resource usage very much. - -But, for the cluster admin, we allow some global configurations -so that the cluster admin can make Tortoises fit their general workloads characteristic. - -See [Flag configurations for admin](./flag-configuration.md). - -### How do workloads exactly get scaled? - -See each document: -- [Horizontal scaling](./horizontal.md) -- [Vertical scaling](./vertical.md) - -### Emergency mode - -We also have the concept "emergency mode" in Tortoise, -which can be used when the workloads need to get scaled up in an unusual case. - -See the document for more detail: [The emergency mode](./emergency.md) - -## Side Notes - -It's implemented based on our experience in mercari.com - -- Our workloads are mostly Golang HTTP/GRPC server. -- Our workloads mostly get traffic from people in the same timezone, and the demand of resources is usually very similar to the same time one week ago. - -Depending on how your workloads look like, tortoise may or may not fit your workloads. diff --git a/docs/emergency.md b/docs/emergency.md index 93e0c8f3..98cb2d2a 100644 --- a/docs/emergency.md +++ b/docs/emergency.md @@ -11,17 +11,16 @@ you can turn on the emergency mode by setting `Emergency` on `.spec.UpdateMode` ### How emergency mode works -When emergency mode is enabled, tortoise increases the `minReplicas` to the same value as `maxReplicas`. +When emergency mode is enabled, tortoise increases the `minReplicas` of HPA to the same value as `maxReplicas`. As described in [Horizontal scaling](./horizontal.md), `maxReplicas` gets changed to be fairly higher value every hour. So, during emergency mode, the replicas will be kept fairly high value calculated from the past behavior for the safety. -### turning emergency mode off +### Turn off emergency mode Also, for the safety, after reverting `UpdateMode` from `Emergency` to `Auto`, - Tortoise tries to reduce the number of replicas to the original value gradually. -(A sudden decrease is mostly dangerous.) +(A sudden decrease in a replica number is often dangerous.) Specifically, the controller reduces `minReplicas` to the original value gradually by the following formula in one reconciliation: @@ -33,5 +32,5 @@ During gradually reducing the `minReplicas`, the Tortoise is in the `BackToNorma ### Note -Emergency mode is available for tortoises with `Running` or `BackToNormal` phase. +Emergency mode is only available for tortoises with `Running` or `BackToNormal` phase. (because it requires enough historical data to work on) diff --git a/docs/horizontal.md b/docs/horizontal.md index c98c0032..1650ce3e 100644 --- a/docs/horizontal.md +++ b/docs/horizontal.md @@ -7,7 +7,18 @@ by setting `Horizontal` in `Spec.ResourcePolicy[*].AutoscalingPolicy` For `Horizontal` resources, Tortoise keeps changing the corresponding HPA's fields with the recommendation value calculated from the historical usage. -Let's get into detail how each field gets changed. +### Configure Horizontal scaling + +#### Attach your HPA + +You can attach your HPA via `.spec.targetRefs.HorizontalPodAutoscalerName`. + +Currently, Tortoise supports only `type: ContainerResource` metric. + +If HPA has `type: Resource` metrics, Tortoise just removes them because they'd be conflict with `type: ContainerResource` metrics managed by Tortoise. +If HPA has metrics other than `Resource` or `ContainerResource`, Tortoise just keeps them. + +### How Tortoise ### MaxReplicas @@ -21,7 +32,7 @@ max{replica numbers at the same time on the same day of week} * MaxReplicasFacto max{replica numbers at the same time} * MaxReplicasFactor ``` -(refer to [configuration.md](./configuration.md) about each parameter) +(refer to [admin-guide.md](./admin-guide.md) about each parameter) It only takes the num of replicas of the last 4 weeks into consideration. @@ -37,7 +48,7 @@ max{replica numbers at the same time on the same day of week} * MinReplicasFacto max{replica numbers at the same time} * MinReplicasFactor ``` -(refer to [configuration.md](./configuration.md) about each parameter) +(refer to [admin-guide.md](./admin-guide.md) about each parameter) It only takes the num of replicas of the last 4 weeks into consideration. @@ -72,10 +83,6 @@ Looking back the above formula, - make all container's resource utilization below 100%. - Thus, finally `100 - (max{recommended resource usage from VPA}/{current resource request} - {current target utilization})` means the target utilization which only give the bare minimum additional resources. -#### Supported metrics in HPA - -Currently, Tortoise supports only `type: ContainerResource` metric. - ### The container right sizing Although it says "Horizontal", diff --git a/docs/user-guide.md b/docs/user-guide.md new file mode 100644 index 00000000..e268ec68 --- /dev/null +++ b/docs/user-guide.md @@ -0,0 +1,155 @@ +## User guide + +Tortoise + +This page describes a minimum knowledge that the end-users have to know, +and how they can configure Tortoise so that they can let tortoises autoscale their workloads. + +### How tortoise works + +Actually, Tortoise itself doesn't directly change your Pod's resource request or the number of replicas. +It has HorizontalPodAutoscaler and VerticalPodAutoscaler under the hood, +and your tortoise just keeps updating them to be well-optimized based on your workload's historical resource usage. + +### Configuration overview + +Tortoise is designed to be a very simple configuration: + +```yaml +apiVersion: autoscaling.mercari.com/v1beta3 +kind: Tortoise +metadata: + name: lovely-tortoise + namespace: zoo +spec: + updateMode: Auto # enable autoscaling. + targetRefs: # which workload this tortoise autoscales. + scaleTargetRef: + kind: Deployment + name: sample +``` + +This is the example for a minimum required configuration. + +### updateMode + +```yaml +apiVersion: autoscaling.mercari.com/v1beta3 +kind: Tortoise +spec: +... + updateMode: Auto +``` + +`.spec.updateMode` could contain three values: +- `Off` (default): DryRun mode. The tortoise doesn't change anything in your workload or autoscaler. +- `Auto`: The tortoise keep updating your workload or autoscaler to be optimized. +- `Emergency`: The tortoise scale up/out your workload to be big enough so that the workload can handle unexpectedly bigger traffic. + +#### updateMode: `Off` + +`Off` is the default value of `updateMode`. +It means a DryRun mode - the tortoise doesn't change anything in your workload or autoscaler. + +But, even during `Off` mode, the tortoise actually generates the recommendation for your workload's resource request, and your HPA's target utilization. + +You can observe the recommendation values with these metrics: +- `mercari.tortoise.proposed_cpu_request`: CPU request a tortoise proposes. +- `mercari.tortoise.proposed_memory_request`: memory request that a tortoise proposes. +- `mercari.tortoise.proposed_hpa_minreplicas`: HPA `.spec.minReplicas` that a tortoise proposes. +- `mercari.tortoise.proposed_hpa_maxreplicas`: HPA `.spec.maxReplicas` that a tortoise proposes. +- `mercari.tortoise.proposed_hpa_utilization_target`: HPA `.spec.metrics[*].containerResource.target.averageUtilization` that a tortoise proposes. + +#### updateMode: `Auto` + +`Auto` is a update mode to let tortoise keep updating your workload or autoscaler to be optimized. + +#### updateMode: `Emergency` + +`Emergency` is a update mode to enable the emergency mode. +Please refer to [Emergency mode](./emergency.md) for more details. + +### `.spec.AutoscalingPolicy` + +There are two primary options for configuring resource scaling within containers: +1. Allow Tortoise to automatically determine the appropriate autoscaling policy for each resource. +2. Manually define the autoscaling policy for each resource. + +The AutoscalingPolicy field is mutable; you can modify it at any time, whether from an empty state to populated or vice versa. + +#### 1. Allow Tortoise to automatically determine the appropriate autoscaling policy for each resource + +To do this, you simply leave `.spec.AutoscalingPolicy` unset. + +In this case, Tortoise will adjust the autoscaling policies using the following logic: +- If `.spec.TargetRefs.HorizontalPodAutoscalerName` is not provided, the policies default to "Horizontal" for CPU and "Vertical" for memory across all containers. +- If `.spec.TargetRefs.HorizontalPodAutoscalerName` is specified, resources governed by the referenced Horizontal Pod Autoscaler will use a "Horizontal" policy, +while those not managed by the HPA will use a "Vertical" policy. +Note that Tortoise supports only the `ContainerResource` metric type for HPAs; other metric types will be disregarded. +Additionally, if a `ContainerResource` metric is later added to an HPA associated with Tortoise, +Tortoise will automatically update relevant resources to utilize a `Horizontal` policy in AutoscalingPolicy. + +#### 2. Manually define the autoscaling policy for each resource. + +With the second option, you must manually specify the AutoscalingPolicy for the resources of each container within this field. + +```yaml +apiVersion: autoscaling.mercari.com/v1beta3 +kind: Tortoise +spec: +... + autoscalingPolicy: + - containerName: istio-proxy + policy: + cpu: Horizontal + memory: Vertical + - containerName: app + policy: + cpu: Horizontal + memory: Vertical +``` + +AutoscalingPolicy is an optional field for specifying the scaling approach for each resource within each container. +- `Horizontal`: Tortoise increases the replica number when the resource utilization goes up. +- `Vertical`: Tortoise scales up the resource given to the container when the resource utilization goes up. +- `Off`(default): Tortoise doesn't look at the resource of the container at all. + +If policies are defined for some but not all containers or resources, Tortoise will assign a default `Off` policy to unspecified resources. +Be aware that when new containers are introduced to the workload, the AutoscalingPolicy configuration must be manually updated +if you want to configure autoscaling for a new container, +as Tortoise will default to an `Off` policy for resources within the new container, preventing scaling. + +### `.spec.DeletionPolicy` + +```yaml +apiVersion: autoscaling.mercari.com/v1beta3 +kind: Tortoise +spec: +... + deletionPolicy: "DeleteAll" +``` + +DeletionPolicy is the policy how the controller deletes associated HPA and VPAs when tortoise is removed. + +- `DeleteAll`: tortoise deletes all associated HPA and VPAs, created by tortoise. +But, if the associated HPA is not created by tortoise, that is associated by `spec.targetRefs.horizontalPodAutoscalerName`, +tortoise doesn't delete the HPA even with `DeleteAll`. +- `NoDelete`(default): tortoise doesn't delete any associated HPA and VPAs. + +### `.spec.ResourcePolicy` + +```yaml +apiVersion: autoscaling.mercari.com/v1beta3 +kind: Tortoise +spec: +... + resourcePolicy: + - containerName: istio-proxy + minAllocatedResources: + cpu: "4" +``` + +ResourcePolicy contains the policy how each resource is updated. +It currently only contains `minAllocatedResources` to indicate the minimum amount of resources which is given to the container. +e.g., if `minAllocatedResources` is configured as the above example, Tortoise won't set cpu smaller than `4` in `istio-proxy` container +even if the autoscaling policy for `istio-container` cpu is `Vertical` and VPA suggests changing cpu smaller than `4`.