Skip to content

Commit

Permalink
[stress] Use matrix for parallel tests. PDB improvements+docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
benbp committed Sep 22, 2023
1 parent 7d50367 commit c16e1dd
Show file tree
Hide file tree
Showing 13 changed files with 89 additions and 68 deletions.
2 changes: 1 addition & 1 deletion eng/common/scripts/stress-testing/deploy-stress-tests.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ param(
[Parameter(Mandatory=$False)][array]$MatrixNonSparseParameters,

# Prevent kubernetes from deleting nodes or rebalancing pods related to this test for N days
[Parameter(Mandatory=$False)][int]$LockDeletionForDays
[Parameter(Mandatory=$False)][ValidateRange(1, 14)][int]$LockDeletionForDays
)

. $PSScriptRoot/stress-test-deployment-lib.ps1
Expand Down
58 changes: 39 additions & 19 deletions tools/stress-cluster/chaos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ The chaos environment is an AKS cluster (Azure Kubernetes Service) with several

* [Installation](#installation)
* [Deploying a Stress Test](#deploying-a-stress-test)
* [Locking a test to run for a minimum number of days](#locking-a-test-to-run-for-a-minimum-number-of-days)
* [Creating a Stress Test](#creating-a-stress-test)
* [Layout](#layout)
* [Stress Test Metadata](#stress-test-metadata)
Expand Down Expand Up @@ -113,6 +114,27 @@ you can quick check the local logs:
kubectl logs -n <stress test namespace> <stress test pod name>
```

### Locking a test to run for a minimum number of days

Occasionally the Kubernetes cluster can cause disruptions to long running tests. This will show up as a test pod
disappearing in the cluster (though all logs and other telemetry will still be available in app insights). This can
happen when nodes are auto-upgraded or scaled down to reduce resource usage.

If a test must be run for a long time, it can be disruptive when a node reboot/shutdown happens. This can be prevented
by setting the `-LockDeletionForDays` parameter. When this parameter is set, the test pods will be deployed alongside a
[PodDisruptionBudget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) that prevents nodes hosting the
pods from being removed. After the set number of days, this pod disruption budget will be deleted and the test will be
interruptable again. The test will not automatically shut down after this time, but it will no longer be locked.

```
<repo root>/eng/common/scripts/stress-testing/deploy-stress-tests.ps1 -LockDeletionForDays 7
```

To see when a pod's deletion lock will expire:

```
kubectl get pod -n <namespace> <pod name> -o jsonpath='{.metadata.annotations.deletionLockExpiry}'
```

## Creating a Stress Test

Expand Down Expand Up @@ -378,33 +400,31 @@ spec:
#### Run multiple pods in parallel within a test job

In some cases it may be necessary to run multiple instances of the same process/container in parallel as part of a test,
for example an eventhub test that needs to run 3 consumers, each in their own container. This can be achieved using
the `stress-test-addons.parallel-deploy-job-template.from-pod` template. The parallel feature leverages the
for example an eventhub test that needs to run 3 consumers, each in their own container. This can be achieved by adding
a `parallel` field in the matrix config. The parallel feature leverages the
[job completion mode](https://kubernetes.io/docs/concepts/workloads/controllers/job/#completion-mode) feature. Test
commands in the container can read the `JOB_COMPLETION_INDEX` environment variable to make decisions. For example,
a messaging test that needs to run a single producer and multiple consumers can have logic that runs the producer when
`JOB_COMPLETION_INDEX` is 0, and a consumer when it is not 0.

See the below example to enable parallel pods. Note the `(list . "stress.parallel-pod-example 3)` segment. The final argument (shown as `3` in the example) sets how many parallel pods should be run.

See a full working example of parallel pods [here](https://github.com/Azure/azure-sdk-tools/blob/main/tools/stress-cluster/chaos/examples/parallel-pod-example).

See the below example to enable parallel pods via the matrix config (`scenarios-matrix.yaml`):

```
{{- include "stress-test-addons.parallel-deploy-job-template.from-pod" (list . "stress.parallel-pod-example" 3) -}}
{{- define "stress.parallel-pod-example" -}}
metadata:
labels:
testName: "parallel-pod-example"
spec:
containers:
- name: parallel-pod-example
image: busybox
command: ['bash', '-c']
args:
- |
echo "Completed pod instance $JOB_COMPLETION_INDEX"
{{- include "stress-test-addons.container-env" . | nindent 6 }}
{{- end -}}
# scenarios-matrix.yaml
matrix:
scenarios:
parallel-example-a:
description: "Example for running multiple test containers in parallel"
# Adding this field into a matrix entry determines
# how many pods will run in parallel
parallel: 3
parallel-example-b:
description: "Example for running multiple test containers in parallel"
parallel: 2
non-parallel-example:
description: "This scenario is not run multiple pods in parallel"
```

NOTE: when multiple pods are run, each pod will invoke its own azure deployment init container. When many of these containers
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ annotations:

dependencies:
- name: stress-test-addons
version: ~0.2.0
version: ~0.3.0
repository: "@stress-test-charts"
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ annotations:

dependencies:
- name: stress-test-addons
version: ~0.2.0
version: ~0.3.0
repository: "@stress-test-charts"
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ annotations:

dependencies:
- name: stress-test-addons
version: ~0.2.0
version: ~0.3.0
repository: "@stress-test-charts"
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
matrix:
scenarios:
parallel:
parallel-example-a:
description: "Example for running multiple test containers in parallel"
# Adding this field into a matrix entry determines
# how many pods will run in parallel
parallel: 3
parallel-example-b:
description: "Example for running multiple test containers in parallel"
parallel: 2
non-parallel-example:
description: "This scenario is not run multiple pods in parallel"
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
{{- /*
The 3rd argument to this template (set as `3` below) is what determines the parallel pod count.
*/}}
{{- include "stress-test-addons.parallel-deploy-job-template.from-pod" (list . "stress.parallel-pod-example" 3) -}}
{{- include "stress-test-addons.deploy-job-template.from-pod" (list . "stress.parallel-pod-example") -}}
{{- define "stress.parallel-pod-example" -}}
metadata:
labels:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ annotations:

dependencies:
- name: stress-test-addons
version: ~0.2.0
version: ~0.3.0
repository: "@stress-test-charts"
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,5 @@ annotations:

dependencies:
- name: stress-test-addons
version: ~0.2.0
version: ~0.3.0
repository: "@stress-test-charts"
Original file line number Diff line number Diff line change
@@ -1,5 +1,17 @@
# Release History

## 0.3.0 (2023-09-22)

### Breaking Changes

Move parallel job configuration into special matrix field `parallel` so that
parallelism can be set per scenario. Remove parallel-deploy-job-template way
of setting parallelism added in the 0.2.1 release.

### Features Added

Adds support for pod disruption budgets when helm values PodDisruptionBudgetExpiry and PodDisruptionBudgetExpiryCron are set. When the expiry is set, a pdb will be created matching all pods in a release, and a cron job will be created to clean up the pdb on a specified date. This allows users to mark a test as non-interruptable so that kubernetes will not shut down the node for upgrades, rebalancing, etc.

## 0.2.2 (2023-09-21)

### Features Added
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@ apiVersion: v2
name: stress-test-addons
description: Baseline resources and templates for stress testing clusters

version: 0.2.2
version: 0.3.0
appVersion: v0.1
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: disruption-budget-{{ .Release.Name }}
name: {{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
release: {{ .Release.Name }}
Expand All @@ -19,13 +19,13 @@ spec:
kind: ServiceAccount
apiVersion: v1
metadata:
name: pod-disruption-budget-expiry-{{ .Release.Name }}
name: pdb-read-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-disruption-budget-expiry-{{ .Release.Name }}
name: pdb-read-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
rules:
- apiGroups: ["*"]
Expand All @@ -35,20 +35,20 @@ rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-disruption-budget-expiry-{{ .Release.Name }}
name: pdb-read-{{ .Release.Name }}
namespace: {{ .Release.Namespace }}
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: pod-disruption-budget-expiry-{{ .Release.Name }}
name: pdb-read-{{ .Release.Name }}
subjects:
- kind: ServiceAccount
name: pod-disruption-budget-expiry-{{ .Release.Name }}
name: pdb-read-{{ .Release.Name }}
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: pod-disruption-budget-expiry-{{ .Release.Name }}
name: pdb-del-{{ substr 0 39 .Release.Name }}-{{ lower (randAlphaNum 3) }}
namespace: {{ .Release.Namespace }}
spec:
concurrencyPolicy: Forbid
Expand All @@ -59,7 +59,7 @@ spec:
activeDeadlineSeconds: 600
template:
spec:
serviceAccountName: pod-disruption-budget-expiry-{{ .Release.Name }}
serviceAccountName: pdb-read-{{ .Release.Name }}
restartPolicy: OnFailure
containers:
- name: kubectl
Expand Down
Original file line number Diff line number Diff line change
@@ -1,16 +1,9 @@
{{- define "stress-test-addons.job-wrapper.tpl" -}}
{{- $global := index . 0 -}}
{{- $definition := index . 1 -}}
spec:
template:
{{- include (index . 1) (index . 0) | nindent 4 -}}
{{- end -}}

{{- define "stress-test-addons.parallel-job-wrapper.tpl" -}}
spec:
completions: {{ index . 2 }}
parallelism: {{ index . 2 }}
completionMode: Indexed
template:
{{- include (index . 1) (index . 0) | nindent 4 -}}
{{- include $definition $global | nindent 4 -}}
{{- end -}}

{{- define "stress-test-addons.deploy-job-template.tpl" -}}
Expand All @@ -25,6 +18,11 @@ metadata:
resourceGroupName: {{ .Stress.ResourceGroupName }}
baseName: {{ .Stress.BaseName }}
spec:
{{- if .Stress.parallel }}
completions: {{ .Stress.parallel }}
parallelism: {{ .Stress.parallel }}
completionMode: Indexed
{{- end }}
backoffLimit: 0
template:
metadata:
Expand Down Expand Up @@ -74,25 +72,6 @@ spec:
{{- end }}
{{- end -}}

{{- define "stress-test-addons.parallel-deploy-job-template.from-pod" -}}
{{- $global := index . 0 -}}
{{- $podDefinition := index . 1 -}}
{{- $parallel := index . 2 -}}
# Configmap template that adds the stress test ARM template for mounting
{{- include "stress-test-addons.deploy-configmap" $global }}
{{- range (default (list "stress") $global.Values.scenarios) }}
---
{{ $jobCtx := fromYaml (include "stress-test-addons.util.mergeStressContext" (list $global . )) }}
{{- $jobOverride := fromYaml (include "stress-test-addons.parallel-job-wrapper.tpl" (list $jobCtx $podDefinition $parallel)) -}}
{{- $tpl := fromYaml (include "stress-test-addons.deploy-job-template.tpl" $jobCtx) -}}
{{- toYaml (merge $jobOverride $tpl) -}}
{{- end }}
{{- include "stress-test-addons.static-secrets" $global }}
{{- if $global.Values.PodDisruptionBudgetExpiry }}
{{- include "stress-test-addons.pod-disruption-budget" $global }}
{{- end }}
{{- end -}}

{{- define "stress-test-addons.env-job-template.tpl" -}}
apiVersion: batch/v1
kind: Job
Expand All @@ -105,6 +84,11 @@ metadata:
resourceGroupName: {{ .Stress.ResourceGroupName }}
baseName: {{ .Stress.BaseName }}
spec:
{{- if .Stress.parallel }}
completions: {{ .Stress.parallel }}
parallelism: {{ .Stress.parallel }}
completionMode: Indexed
{{- end }}
backoffLimit: 0
template:
metadata:
Expand Down

0 comments on commit c16e1dd

Please sign in to comment.