diff --git a/tools/stress-cluster/cluster/README.md b/tools/stress-cluster/cluster/README.md index 86afd4d2a48..deaf9b0af34 100644 --- a/tools/stress-cluster/cluster/README.md +++ b/tools/stress-cluster/cluster/README.md @@ -1,162 +1,149 @@ -This directory contains [Azure Bicep](https://docs.microsoft.com/en-us/azure/azure-resource-manager/bicep/overview) +Table of Contents + +* [Layout](#layout) +* [Dependencies](#dependencies) +* [Deploying Cluster(s)](#deploying-clusters) + * [Dev Cluster](#dev-cluster) + * [Test Cluster](#test-cluster) + * [Prod Cluster](#prod-cluster) + * [Local Cluster](#local-cluster) +* [Development](#development) + * [Bicep templates](#bicep-templates) + * [Helm templates](#helm-templates) + + +# Layout + +This directory contains all configuration used for stress test cluster buildout (azure and kubernetes buildout), as well +as a set of common stress test config boilerplate (helm library). + +The `./azure` directory contains [Azure Bicep](https://docs.microsoft.com/en-us/azure/azure-resource-manager/bicep/overview) files for deploying Azure resources (mainly [AKS clusters](https://azure.microsoft.com/en-us/services/kubernetes-service/) to support stress testing (for dev/test and/or production). Azure Bicep comes pre-installed with the Azure CLI, and is a DSL for generating ARM templates. +The `./kubernetes/stress-infrastructure` directory contains a helm chart for deploying the core services +that must be installed into any stress cluster: chaos-mesh (for chaos) and stress-watcher (for event handling like chaos +resource start and resource group cleanup). + +The `./kubernetes/stress-test-addons` directory contains a [library chart](https://helm.sh/docs/topics/library_charts/) +for use by stress test packages. This common set of config boilerplate simplifies stress test authoring, and makes it +easier to make and roll out config changes to tests across repos by using helm chart dependency versioning. + + # Dependencies +- [Powershell Core](https://docs.microsoft.com/en-us/powershell/scripting/install/installing-powershell-core-on-linux?view=powershell-7.1#ubuntu-2004) (if using Linux) - [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) - - If using app insights, install the az extension: `az extension add --name application-insights` -- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) (if accessing clusters) -- [helm](https://helm.sh) (if installing stress infrastructure) +- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) +- [helm](https://helm.sh) - [kind](https://github.com/kubernetes-sigs/kind/releases) (if testing locally) - [Docker](https://docs.docker.com/get-docker/) (if deploying/testing locally) -# Cluster Deployment Quick Start -## Deploying a Dev Cluster +# Deploying Cluster(s) -First, update the `./azure/parameters/dev.json` parameters file with the values marked `// add me`, then: +The cluster-specific configurations can be found at `./azure/parameters/.json`. -``` -az deployment sub create -o json -n -l westus -f ./azure/main.bicep --parameters ./azure/parameters/dev.json +Almost all stress test infrastructure is local to the cluster resource group, including storage accounts, keyvaults, +log workspaces and the AKS cluster. There is also a set of static resources, including a subscription service principal +and a keyvault containing the credential configuration. These are shared across clusters located in the same subscription +and are provisioned independently of the bicep templates. -# wait until resource group and AKS cluster are deployed -az aks get-credentials stress-azuresdk -g rg-stress-test-cluster- -``` +Cluster buildout and deployment involves three main steps which are automated in `./provision.ps1`: -## Deploying a Local Cluster +1. Provision static resources (service principal, role assignments, static keyvault). +1. Provision cluster resources (`main.bicep` entrypoint, standard ARM subscription deployment). +1. Provision stress infrastructures resources into the Azure Kubernetes Service cluster via helm + (`./kubernetes/stress-infrastructure` helm chart). -NOTE: Chaos-Mesh may not work on all local deployments (e.g. Docker Desktop on Windows via WSL). -It may be easier to test services, manifests and containers locally with KIND, and test chaos -in an Azure AKS cluster (shared or personal). +## Dev Cluster -``` -# Ensure docker is running -kind create cluster -``` - -## Deploying Stress Infrastructure into Cluster +First, update the `./azure/parameters/dev.json` parameters file with the values marked `// add me`, then run: ``` -helm repo add chaos-mesh https://charts.chaos-mesh.org -helm dependency update ./kubernetes/stress-infrastructure -helm install stress-infra -n stress-infra --create-namespace ./kubernetes/stress-infrastructure +./provision.ps1 -env dev ``` +## Test Cluster -# Development - -Examples detailing the Azure Bicep DSL can be found [here](https://github.com/Azure/bicep/tree/main/docs/examples). +The test cluster is the main ad-hoc cluster made available to SDK developers and partners. Changes to this cluster +should be made carefully and announced in advance in order not to disrupt people's work. -Bicep also has a [VSCode extension](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-bicep). - -To validate file changes/compilation: - -``` -az bicep build -f ./azure/main.bicep ``` - -To deploy and access resources: - +./provision.ps1 -env test ``` -# Edit ./azure/parameters/dev.json, replacing // add me values -# Add -c to dry run changes with a chance to confirm -az deployment sub create -o json -n -l westus -f ./azure/main.bicep --parameters ./azure/parameters/dev.json -# Copy the relevant outputs from the deployment to ./kubernetes/environments/ -# for deploying stress tests later on -az deployment sub show -o json -n --query properties.outputs +## Prod Cluster -az aks list -g rg-stress-test-cluster- -az aks get-credentials stress-test -g rg-stress-test-cluster- +The "prod" cluster is the main cluster used for auto-deployment of checked-in stress tests via the StressTestRelease pipeline. +Currently, new instances of all stress tests across the language repositories are deployed on a weekly cadence. +Changes to the prod cluster should ideally be made around the stress test deployment cycle so as to avoid disruption +of test metrics. -# Verify cluster access -kubectl get pods - -# Install stress infrastructure components -helm repo add chaos-mesh https://charts.chaos-mesh.org -helm dependency update ./kubernetes/stress-infrastructure -helm install stress-infra -n stress-infra --create-namespace ./kubernetes/stress-infrastructure -kubectl get pods --namespace stress-infra ``` - -To access the chaos-mesh dashboard, run the below command then navigate to `localhost:2333` in the browser: - -``` -kubectl port-forward -n stress-infra svc/chaos-dashboard 2333:2333 +./provision.ps1 -env prod ``` -To remove AKS cluster stress testing resources: +## Local Cluster -``` -helm uninstall stress-infra --namespace stress-infra -``` +For quick testing of various kubernetes configurations, it can be faster and cheaper to use a local cluster. +Not all components of stress testing work in local clusters, however. If testing these components is necessary, the +recommended action is to spin up a dev cluster. -To remove Azure resources: +NOTE: Chaos-Mesh may not work on all local deployments (e.g. Docker Desktop on Windows via WSL). +It may be easier to test services, manifests and containers locally with KIND, and test chaos +in an Azure AKS cluster (shared or personal). ``` -az group delete -az keyvault purge -n +# Ensure docker is running +kind create cluster ``` -# Building out the Main/Prod Testing Cluster - -If not already done, enable the relevant preview features in the subscription and CLI: -- [AKS-AzureKeyVaultSecretsProvider](https://docs.microsoft.com/en-us/azure/aks/csi-secrets-store-driver#register-the-aks-azurekeyvaultsecretsprovider-preview-feature) -## Initializing static identities - -The "official" stress testing clusters rely on a separately created keyvault containing secrets with subscription credentials for stress test resource deployments. -The identities/credentials in these keyvaults can't be created via ARM/Bicep, and should be managed independently of the individual environments. +# Development -To initialize these resources, if they don't exist: +## Bicep templates -``` -az group create rg-StressTestSecrets -az keyvault create -n StressTestSecrets -g rg-StressTestSecrets -az ad sp create-for-rbac -n 'stress-test-provisioner' --role Contributor --scopes '/subscriptions/' -``` - -Create an env file with the service principal values created above: +Examples detailing the Azure Bicep DSL can be found [here](https://github.com/Azure/bicep/tree/main/docs/examples). -``` -AZURE_CLIENT_OID= -AZURE_CLIENT_ID= -AZURE_CLIENT_SECRET= -AZURE_TENANT_ID= -AZURE_SUBSCRIPTION_ID= -``` +Bicep also has a [VSCode extension](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-bicep). -Upload it to the static keyvault: +To validate file changes/compilation: ``` -az keyvault secret set --vault-name StressTestSecrets -f ./ -n public +az bicep build -f ./azure/main.bicep ``` -## Building Out Stress Test Cluster Resources +## Helm templates -Various environment configurations are located in `./azure/parameters/.json` to be configured when deploying. +When making changes to `stress-test-addons`, it is easiest to validate them by building one of the [example projects +](https://github.com/Azure/azure-sdk-tools/tree/main/tools/stress-cluster/chaos/examples). -Deploy the cluster and related components (app insights, container registry, keyvault, access policies, etc.) +First, update the `dependencies section of the example's `Chart.yaml` file to point to your local changes on disk: ``` -az deployment sub create -o json -n stress-test-deploy -l westus -f ./azure/main.bicep --parameters ./azure/parameters/test.json +dependencies: +- name: stress-test-addons + version: + repository: https://stresstestcharts.blob.core.windows.net/helm/ + repository: file:////tools/stress-cluster/cluster/kubernetes/stress-test-addons ``` -Gain access to the cluster and install the stress infrastructure components: +Then you can test out the template changes by running, in the example stress test package directory: ``` -az aks get-credentials stress-test -g rg-stress-test-cluster- - -helm repo add chaos-mesh https://charts.chaos-mesh.org -helm dependency update ./kubernetes/stress-infrastructure -helm install stress-infra -n stress-infra --create-namespace ./kubernetes/stress-infrastructure +helm template testrelease . ``` -Update the values in `./kubernetes/stress-test-addons/values.yaml` to match the deployment outputs and check in the changes. +If there are any issues, the helm command will print any errors. If there are no errors, the rendered yaml +may still be an invalid kubernetes manifest, so the example stress test should also be deployed to validate +the full set of changes: ``` -az deployment sub show -o json -n --query properties.outputs +# -Login only needs to be run once or if the azure container registry credentials have expired (~24 hours) +/eng/common/scripts/stress-testing/deploy-stress-tests.ps1 -Login ``` + +For more helm debugging info, see [here](https://helm.sh/docs/chart_template_guide/debugging/).