Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully automate stress cluster buildout and add support for azure file share mounting #2106

Merged
10 commits merged into from
Oct 22, 2021
4 changes: 2 additions & 2 deletions eng/common/scripts/stress-testing/deploy-stress-tests.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,9 @@ function DeployStressTests(
[string]$environment = 'test',
[string]$repository = 'images',
[boolean]$pushImages = $false,
[string]$clusterGroup = 'rg-stress-test-cluster-',
[string]$clusterGroup = 'rg-stress-cluster-test',
[string]$deployId = 'local',
[string]$subscription = 'Azure SDK Test Resources'
[string]$subscription = 'Azure SDK Developer Playground'
) {
if ($PSCmdlet.ParameterSetName -eq 'DoLogin') {
Login $subscription $clusterGroup $pushImages
Expand Down
21 changes: 20 additions & 1 deletion eng/containers/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,12 @@ parameters:
dockerFile: 'tools/test-proxy/docker/dockerfile-win'
stableTags:
- 'latest'
- name: stress_watcher
pool: 'ubuntu-20.04'
dockerRepo: 'stress/watcher'
dockerFile: 'tools/stress-cluster/services/Stress.Watcher/Dockerfile'
stableTags:
- 'latest'

trigger:
branches:
Expand All @@ -32,8 +38,18 @@ trigger:
- eng/containers/
- tools/test-proxy/docker/
- tools/keyvault-mock-attestation/Dockerfile
- tools/stress-cluster/services/Stress.Watcher/Dockerfile

pr: none
pr:
scbedd marked this conversation as resolved.
Show resolved Hide resolved
branches:
include:
- main
paths:
include:
- eng/containers/
- tools/test-proxy/docker/
- tools/keyvault-mock-attestation/Dockerfile
- tools/stress-cluster/services/Stress.Watcher/Dockerfile

variables:
- name: containerRegistry
Expand Down Expand Up @@ -64,6 +80,7 @@ jobs:

- task: Docker@2
displayName: Push ${{ config.name }}:$(imageTag)
condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))
inputs:
containerRegistry: $(containerRegistry)
repository: ${{ config.dockerRepo }}
Expand All @@ -81,6 +98,8 @@ jobs:

- task: Docker@2
displayName: Push ${{ config.name }}:${{ stableTag }}
condition: and(succeeded(), ne(variables['Build.Reason'], 'PullRequest'))

inputs:
containerRegistry: $(containerRegistry)
repository: ${{ config.dockerRepo }}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
apiVersion: v2
name: debug-share-example
description: An example stress test chart that uses a file share for debugging (e.g. for large log files, heap dumps)
version: 0.1.1
appVersion: v0.1
annotations:
stressTest: 'true' # enable auto-discovery of this test via `find-all-stress-packages.ps1`
example: 'true' # enable auto-discovery filtering `find-all-stress-packages.ps1 -filters @{example='true'}`
namespace: 'examples'

dependencies:
- name: stress-test-addons
version: 0.1.9
repository: https://stresstestcharts.blob.core.windows.net/helm/
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{{- include "stress-test-addons.env-job-template.from-pod" (list . "stress.deploy-example") -}}
{{- define "stress.deploy-example" -}}
metadata:
labels:
testName: "debug-share-example"
spec:
containers:
- name: debug-share-example
image: busybox
command: ['sh', '-c']
args:
- |
cd $DEBUG_SHARE;
pwd;
mkdir example;
echo "debug share example success" > example/success;
ls; ls example; cat example/success;
# The file share is mounted by default at the path $DEBUG_SHARE
# when including the container-env template
{{- include "stress-test-addons.container-env" . | nindent 6 }}
{{- end -}}
194 changes: 97 additions & 97 deletions tools/stress-cluster/cluster/README.md
Original file line number Diff line number Diff line change
@@ -1,160 +1,160 @@
This directory contains [Azure Bicep](https://docs.microsoft.com/en-us/azure/azure-resource-manager/bicep/overview)
Table of Contents
* [Layout](#layout)
* [Dependencies](#dependencies)
* [Deploying Cluster(s)](#deploying-clusters)
* [Dev Cluster](#dev-cluster)
* [Test Cluster](#test-cluster)
* [Prod Cluster](#prod-cluster)
* [Local Cluster](#local-cluster)
* [Development](#development)
* [Bicep templates](#bicep-templates)
* [Helm templates](#helm-templates)


# Layout

This directory contains all configuration used for stress test cluster buildout (azure and kubernetes buildout), as well
as a set of common stress test config boilerplate (helm library).

The `./azure` directory contains [Azure Bicep](https://docs.microsoft.com/en-us/azure/azure-resource-manager/bicep/overview)
files for deploying Azure resources (mainly [AKS clusters](https://azure.microsoft.com/en-us/services/kubernetes-service/)
to support stress testing (for dev/test and/or production).

Azure Bicep comes pre-installed with the Azure CLI, and is a DSL for generating ARM templates.

The `./kubernetes/stress-infrastructure` directory contains a helm chart for deploying the core services
that must be installed into any stress cluster: chaos-mesh (for chaos) and stress-watcher (for event handling like chaos
resource start and resource group cleanup).

The `./kubernetes/stress-test-addons` directory contains a [library chart](https://helm.sh/docs/topics/library_charts/)
for use by stress test packages. This common set of config boilerplate simplifies stress test authoring, and makes it
easier to make and roll out config changes to tests across repos by using helm chart dependency versioning.


# Dependencies

- [Powershell Core](https://docs.microsoft.com/en-us/powershell/scripting/install/installing-powershell-core-on-linux?view=powershell-7.1#ubuntu-2004) (if using Linux)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend powershell core for all OS's, as there is a lot of improvements in networking over top of windows powershell as well.

- [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli)
- If using app insights, install the az extension: `az extension add --name application-insights`
- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) (if accessing clusters)
- [helm](https://helm.sh) (if installing stress infrastructure)
- [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl)
- [helm](https://helm.sh)
- [kind](https://github.com/kubernetes-sigs/kind/releases) (if testing locally)
- [Docker](https://docs.docker.com/get-docker/) (if deploying/testing locally)

# Cluster Deployment Quick Start

## Deploying a Dev Cluster
# Deploying Cluster(s)

First, update the `./azure/parameters/dev.json` parameters file with the values marked `// add me`, then:
The cluster-specific configurations can be found at `./azure/parameters/<environment>.json`.

```
az deployment sub create -o json -n <your name> -l westus -f ./azure/main.bicep --parameters ./azure/parameters/dev.json

# wait until resource group and AKS cluster are deployed
az aks get-credentials stress-azuresdk -g rg-stress-test-cluster-<group suffix parameter>
```
Almost all stress test infrastructure is local to the cluster resource group, including storage accounts, keyvaults,
log workspaces and the AKS cluster. There is also a set of static resources, including a subscription service principal
and a keyvault containing the credential configuration. These are shared across clusters located in the same subscription
and are provisioned independently of the bicep templates.

## Deploying a Local Cluster
Cluster buildout and deployment involves three main steps which are automated in `./provision.ps1`:

NOTE: Chaos-Mesh may not work on all local deployments (e.g. Docker Desktop on Windows via WSL).
It may be easier to test services, manifests and containers locally with KIND, and test chaos
in an Azure AKS cluster (shared or personal).
1. Provision static resources (service principal, role assignments, static keyvault).
1. Provision cluster resources (`main.bicep` entrypoint, standard ARM subscription deployment).
1. Provision stress infrastructures resources into the Azure Kubernetes Service cluster via helm
(`./kubernetes/stress-infrastructure` helm chart).

```
# Ensure docker is running
kind create cluster
```
## Dev Cluster

## Deploying Stress Infrastructure into Cluster
First, update the `./azure/parameters/dev.json` parameters file with the values marked `// add me`, then run:

```
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm dependency update ./kubernetes/stress-infrastructure
helm install stress-infra -n stress-infra --create-namespace ./kubernetes/stress-infrastructure
./provision.ps1 -env dev
```

To deploy stress test packages to the dev environment
(e.g. the [examples](https://github.com/Azure/bicep/tree/main/docs/examples)), pass in `-Environment dev` (see below).
The provision script will update the `./kubernetes/stress-test-addons/values.yaml` file with all the relevant
resource values from the newly provisioned dev environment that are required by the stress test common configuration.

# Development

Examples detailing the Azure Bicep DSL can be found [here](https://github.com/Azure/bicep/tree/main/docs/examples).
Avoid checking in the updated dev values, they are for local use only.

Bicep also has a [VSCode extension](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-bicep).

To validate file changes/compilation:

```
az bicep build -f ./azure/main.bicep
```

To deploy and access resources:

# -Login only needs to be run once or if the azure container registry credentials have expired (~24 hours)
<tools repo>/eng/common/scripts/stress-testing/deploy-stress-tests.ps1 -Login -Environment dev
```
# Edit ./azure/parameters/dev.json, replacing // add me values
# Add -c to dry run changes with a chance to confirm
az deployment sub create -o json -n <your name> -l westus -f ./azure/main.bicep --parameters ./azure/parameters/dev.json

# Copy the relevant outputs from the deployment to ./kubernetes/environments/<environment yaml file>
# for deploying stress tests later on
az deployment sub show -o json -n <your name> --query properties.outputs
## Test Cluster

az aks list -g rg-stress-test-cluster-<group suffix parameter>
az aks get-credentials stress-test -g rg-stress-test-cluster-<group suffix parameter>
The test cluster is the main ad-hoc cluster made available to SDK developers and partners. Changes to this cluster
should be made carefully and announced in advance in order not to disrupt people's work.

# Verify cluster access
kubectl get pods

# Install stress infrastructure components
helm repo add chaos-mesh https://charts.chaos-mesh.org
helm dependency update ./kubernetes/stress-infrastructure
helm install stress-infra -n stress-infra --create-namespace ./kubernetes/stress-infrastructure
kubectl get pods --namespace stress-infra
```
./provision.ps1 -env test
```

To access the chaos-mesh dashboard, run the below command then navigate to `localhost:2333` in the browser:
## Prod Cluster

The "prod" cluster is the main cluster used for auto-deployment of checked-in stress tests via the StressTestRelease pipeline.
Currently, new instances of all stress tests across the language repositories are deployed on a weekly cadence.
Changes to the prod cluster should ideally be made around the stress test deployment cycle so as to avoid disruption
of test metrics.

```
kubectl port-forward -n stress-infra svc/chaos-dashboard 2333:2333
./provision.ps1 -env prod
```

To remove AKS cluster stress testing resources:
## Local Cluster

```
helm uninstall stress-infra --namespace stress-infra
```
For quick testing of various kubernetes configurations, it can be faster and cheaper to use a local cluster.
Not all components of stress testing work in local clusters, however. If testing these components is necessary, the
recommended action is to spin up a dev cluster.

To remove Azure resources:
NOTE: Chaos-Mesh may not work on all local deployments (e.g. Docker Desktop on Windows via WSL).
It may be easier to test services, manifests and containers locally with KIND, and test chaos
in an Azure AKS cluster (shared or personal).

```
az group delete <resource group name>
az keyvault purge -n <keyvault name>
# Ensure docker is running
kind create cluster
```

# Building out the Main/Prod Testing Cluster

If not already done, enable the relevant preview features in the subscription and CLI:
- [AKS-AzureKeyVaultSecretsProvider](https://docs.microsoft.com/en-us/azure/aks/csi-secrets-store-driver#register-the-aks-azurekeyvaultsecretsprovider-preview-feature)

## Initializing static identities

The "official" stress testing clusters rely on a separately created keyvault containing secrets with subscription credentials for stress test resource deployments.
The identities/credentials in these keyvaults can't be created via ARM/Bicep, and should be managed independently of the individual environments.
# Development

To initialize these resources, if they don't exist:
## Bicep templates

```
az group create rg-StressTestSecrets
az keyvault create -n StressTestSecrets -g rg-StressTestSecrets
az ad sp create-for-rbac -n 'stress-test-provisioner' --role Contributor --scopes '/subscriptions/<subscription id>'
```

Create an env file with the service principal values created above:
Examples detailing the Azure Bicep DSL can be found [here](https://github.com/Azure/bicep/tree/main/docs/examples).

```
AZURE_CLIENT_ID=<app id>
AZURE_CLIENT_SECRET=<password/secret>
AZURE_TENANT_ID=<tenant id>
```
Bicep also has a [VSCode extension](https://marketplace.visualstudio.com/items?itemName=ms-azuretools.vscode-bicep).

Upload it to the static keyvault:
To validate file changes/compilation:

```
az keyvault secret set --vault-name StressTestSecrets -f ./<env file> -n public
az bicep build -f ./azure/main.bicep
```

## Building Out Stress Test Cluster Resources
## Helm templates

Various environment configurations are located in `./azure/parameters/<env>.json` to be configured when deploying.
When making changes to `stress-test-addons`, it is easiest to validate them by building one of the [example projects
](https://github.com/Azure/azure-sdk-tools/tree/main/tools/stress-cluster/chaos/examples).

Deploy the cluster and related components (app insights, container registry, keyvault, access policies, etc.)
First, update the `dependencies section of the example's `Chart.yaml` file to point to your local changes on disk:

```
az deployment sub create -o json -n stress-test-deploy -l westus -f ./azure/main.bicep --parameters ./azure/parameters/test.json
dependencies:
- name: stress-test-addons
version: <latest version on disk in stress-test-addons Chart.yaml>
repository: https://stresstestcharts.blob.core.windows.net/helm/
repository: file:///<path to azure-sdk-tools repo>/tools/stress-cluster/cluster/kubernetes/stress-test-addons
```

Gain access to the cluster and install the stress infrastructure components:
Then you can test out the template changes by running, in the example stress test package directory:

```
az aks get-credentials stress-test -g rg-stress-test-cluster-<group suffix>

helm repo add chaos-mesh https://charts.chaos-mesh.org
helm dependency update ./kubernetes/stress-infrastructure
helm install stress-infra -n stress-infra --create-namespace ./kubernetes/stress-infrastructure
helm template testrelease .
```

Update the values in `./kubernetes/stress-test-addons/values.yaml` to match the deployment outputs and check in the changes.
If there are any issues, the helm command will print any errors. If there are no errors, the rendered yaml
may still be an invalid kubernetes manifest, so the example stress test should also be deployed to validate
the full set of changes:

```
az deployment sub show -o json -n <your name> --query properties.outputs
# -Login only needs to be run once or if the azure container registry credentials have expired (~24 hours)
<tools repo>/eng/common/scripts/stress-testing/deploy-stress-tests.ps1 -Login
```

For more helm debugging info, see [here](https://helm.sh/docs/chart_template_guide/debugging/).
Loading