Skip to content

Commit

Permalink
[stress testing] Enable nodepool update via bicep and update node SKUs (
Browse files Browse the repository at this point in the history
#5309)

- Updating stress cluster node SKUs - use d4v4 everywhere, separate system from user pools, and add stress-watcher to the system pool
- Enable updating stress cluster nodepools via bicep deployment via an `-UpdateNodes` switch.
- Remove warnings from bicep build
- Update documentation for pool SKU label selectors
- Minor provision script fixes: missing object id and custom environment parameter validation

Fixes #4830
  • Loading branch information
benbp authored Feb 2, 2023
1 parent b9b9072 commit 7f00c18
Show file tree
Hide file tree
Showing 14 changed files with 114 additions and 80 deletions.
4 changes: 0 additions & 4 deletions eng/common/scripts/stress-testing/deploy-stress-tests.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,7 @@ param(
[switch]$PushImages,
[string]$ClusterGroup,
[string]$DeployId,

[Parameter(ParameterSetName = 'DoLogin', Mandatory = $true)]
[switch]$Login,

[Parameter(ParameterSetName = 'DoLogin')]
[string]$Subscription,

# Default to true in Azure Pipelines environments
Expand Down
13 changes: 8 additions & 5 deletions eng/common/scripts/stress-testing/stress-test-deployment-lib.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -117,12 +117,11 @@ function DeployStressTests(
}
$clusterGroup = 'rg-stress-cluster-prod'
$subscription = 'Azure SDK Test Resources'
} elseif (!$clusterGroup -or !$subscription) {
throw "clusterGroup and subscription parameters must be specified when deploying to an environment that is not pg or prod."
}

if ($login) {
if (!$clusterGroup -or !$subscription) {
throw "clusterGroup and subscription parameters must be specified when logging into an environment that is not pg or prod."
}
Login -subscription $subscription -clusterGroup $clusterGroup -pushImages:$pushImages
}

Expand Down Expand Up @@ -160,7 +159,9 @@ function DeployStressTests(
-environment $environment `
-repositoryBase $repository `
-pushImages:$pushImages `
-login:$login
-login:$login `
-clusterGroup $clusterGroup `
-subscription $subscription
}

if ($FailedCommands.Count -lt $pkgs.Count) {
Expand All @@ -185,7 +186,9 @@ function DeployStressPackage(
[string]$environment,
[string]$repositoryBase,
[switch]$pushImages,
[switch]$login
[switch]$login,
[string]$clusterGroup,
[string]$subscription
) {
$registry = RunOrExitOnFailure az acr list -g $clusterGroup --subscription $subscription -o json
$registryName = ($registry | ConvertFrom-Json).name
Expand Down
7 changes: 3 additions & 4 deletions tools/stress-cluster/chaos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -484,23 +484,22 @@ The `stress-test-addons` helm library will handle a scenarios matrix automatical

### Node Size Requirements

The stress test cluster is deployed with several node SKUs (see [agentPoolProfiles declaration and
The stress test cluster may be deployed with several node SKUs (see [agentPoolProfiles declaration and
variables](https://github.com/Azure/azure-sdk-tools/blob/main/tools/stress-cluster/cluster/azure/cluster/cluster.bicep)), with tests defaulting to the SKU labeled 'default'.
By adding the `nodeSelector` field to the job spec, you can override which nodes the test container will
be provisioned to. For support adding a custom or dedicated node SKU, reach out to the EngSys team.

Available common SKUs in stress test clusters:

- 'default' - Standard\_D2\_v3
- 'highMem' - Standard\_D4ds\_v4
- 'default' - Standard\_D4ds\_v4

To deploy a stress test to a custom node (see also
[example](https://github.com/Azure/azure-sdk-tools/blob/main/tools/stress-cluster/chaos/examples/network-stress-example/templates/testjob.yaml)):

```
spec:
nodeSelector:
sku: 'highMem'
sku: '<nodepool sku label>'
containers:
< container spec ... >
```
Expand Down
4 changes: 4 additions & 0 deletions tools/stress-cluster/cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,13 @@ Cluster buildout and deployment involves three main steps which are automated in

1. Provision static resources (service principal, role assignments, static keyvault).
1. Provision cluster resources (`main.bicep` entrypoint, standard ARM subscription deployment).
- NOTE: if the nodepool configuration for the AKS cluster needs to be updated, it cannot be done
alongside a deployment to the cluster itself. In order to update the nodepool configuration only, pass
the `-UpdateNodes` parameter to the provision script.
1. Provision stress infrastructures resources into the Azure Kubernetes Service cluster via helm
(`./kubernetes/stress-infrastructure` helm chart).


## Dev Cluster

First, update the `./azure/parameters/dev.json` parameters file with the values marked `// add me`, then run:
Expand Down
8 changes: 4 additions & 4 deletions tools/stress-cluster/cluster/azure/cluster/acr.bicep
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,19 @@ resource registry 'Microsoft.ContainerRegistry/registries@2019-12-01-preview' =

// Add AcrPush and AcrPull roles to access groups
resource acrPushRole 'Microsoft.Authorization/roleAssignments@2020-04-01-preview' = [for objectId in objectIds: {
name: '${guid('azureContainerRegistryPushRole', objectId, resourceGroup().id)}'
name: guid('azureContainerRegistryPushRole', objectId, resourceGroup().id)
scope: registry
properties: {
roleDefinitionId: '${subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '8311e382-0749-4cb8-b61a-304f252e45ec')}'
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '8311e382-0749-4cb8-b61a-304f252e45ec')
principalId: objectId
}
}]

resource acrPullRole 'Microsoft.Authorization/roleAssignments@2020-04-01-preview' = [for objectId in objectIds: {
name: '${guid('azureContainerRegistryPullRole', objectId, resourceGroup().id)}'
name: guid('azureContainerRegistryPullRole', objectId, resourceGroup().id)
scope: registry
properties: {
roleDefinitionId: '${subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '7f951dda-4ed3-4680-a7ca-43fe172d538d')}'
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '7f951dda-4ed3-4680-a7ca-43fe172d538d')
principalId: objectId
}
}]
Expand Down
80 changes: 53 additions & 27 deletions tools/stress-cluster/cluster/azure/cluster/cluster.bicep
Original file line number Diff line number Diff line change
Expand Up @@ -4,53 +4,54 @@ param groupSuffix string
param dnsPrefix string = 's1'
param clusterName string
param location string = resourceGroup().location
param enableHighMemAgentPool bool = false
// AKS does not allow agentPool updates via existing managed cluster resources
param updateNodes bool = false

// monitoring parameters
param workspaceId string

var kubernetesVersion = '1.24.3'
var kubernetesVersion = '1.25.4'
var nodeResourceGroup = 'rg-nodes-${dnsPrefix}-${clusterName}-${groupSuffix}'

var defaultAgentPool = {
name: 'default'
count: 3
minCount: 3
maxCount: 9
var systemAgentPool = {
name: 'system'
count: 1
minCount: 1
maxCount: 4
mode: 'System'
vmSize: 'Standard_D2_v3'
vmSize: 'Standard_D4ds_v4'
type: 'VirtualMachineScaleSets'
osType: 'Linux'
enableAutoScaling: true
enableEncryptionAtHost: true
nodeLabels: {
'sku': 'default'
sku: 'system'
}
}

var highMemAgentPool = {
name: 'highmemory'
count: 1
minCount: 1
maxCount: 3
mode: 'System'
var defaultAgentPool = {
name: 'default'
count: 3
minCount: 5
maxCount: 24
mode: 'User'
vmSize: 'Standard_D4ds_v4'
type: 'VirtualMachineScaleSets'
osType: 'Linux'
osDiskType: 'Ephemeral'
enableAutoScaling: true
enableEncryptionAtHost: true
nodeLabels: {
'sku': 'highMem'
sku: 'default'
}
}

var agentPools = concat([
defaultAgentPool
], enableHighMemAgentPool ? [
highMemAgentPool
] : [])
var agentPools = [
systemAgentPool
defaultAgentPool
]

resource cluster 'Microsoft.ContainerService/managedClusters@2020-09-01' = {
resource newCluster 'Microsoft.ContainerService/managedClusters@2022-09-02-preview' = if (!updateNodes) {
name: clusterName
location: location
tags: tags
Expand Down Expand Up @@ -83,14 +84,39 @@ resource cluster 'Microsoft.ContainerService/managedClusters@2020-09-01' = {
}
}

resource existingCluster 'Microsoft.ContainerService/managedClusters@2022-09-02-preview' existing = if (updateNodes) {
name: clusterName
}

// Workaround for duplicate variable names when conditionals are in use
// See https://github.com/Azure/bicep/issues/1410
var cluster = updateNodes ? existingCluster : newCluster

resource pools 'Microsoft.ContainerService/managedClusters/agentPools@2022-09-02-preview' = [for pool in agentPools: if (updateNodes) {
parent: existingCluster
name: pool.name
properties: {
count: pool.count
minCount: pool.minCount
maxCount: pool.maxCount
mode: pool.mode
vmSize: pool.vmSize
type: pool.type
osType: pool.osType
enableAutoScaling: pool.enableAutoScaling
// enableEncryptionAtHost: pool.enableEncryptionAtHost
nodeLabels: pool.nodeLabels
}
}]

// Add Monitoring Metrics Publisher role to omsagent identity. Required to publish metrics data to
// cluster resource container insights.
// https://docs.microsoft.com/azure/azure-monitor/containers/container-insights-update-metrics
resource metricsPublisher 'Microsoft.Authorization/roleAssignments@2020-04-01-preview' = {
name: '${guid('monitoringMetricsPublisherRole', resourceGroup().id)}'
scope: cluster
resource metricsPublisher 'Microsoft.Authorization/roleAssignments@2020-04-01-preview' = if (!updateNodes) {
name: guid('monitoringMetricsPublisherRole', resourceGroup().id)
scope: newCluster
properties: {
roleDefinitionId: '${subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '3913510d-42f4-4e42-8a64-420c390055eb')}'
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '3913510d-42f4-4e42-8a64-420c390055eb')
// NOTE: using objectId over clientId seems to handle cross-region propagation delays better for newly created identities
principalId: cluster.properties.addonProfiles.omsagent.identity.objectId
}
Expand All @@ -99,4 +125,4 @@ resource metricsPublisher 'Microsoft.Authorization/roleAssignments@2020-04-01-pr
output secretProviderObjectId string = cluster.properties.addonProfiles.azureKeyvaultSecretsProvider.identity.objectId
output secretProviderClientId string = cluster.properties.addonProfiles.azureKeyvaultSecretsProvider.identity.clientId
output kubeletIdentityObjectId string = cluster.properties.identityProfile.kubeletidentity.objectId
output clusterName string = cluster.name
output clusterName string = clusterName
36 changes: 20 additions & 16 deletions tools/stress-cluster/cluster/azure/main.bicep
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@ param subscriptionId string = ''
param groupSuffix string
param clusterName string
param clusterLocation string = 'westus3'
param staticTestSecretsKeyvaultName string
param staticTestSecretsKeyvaultGroup string
param staticTestKeyvaultName string
param staticTestKeyvaultGroup string
param monitoringLocation string = 'centralus'
param tags object
param enableHighMemAgentPool bool = false
// AKS does not allow agentPool updates via existing managed cluster resources
param updateNodes bool = false

// Azure Developer Platform Team Group
// https://ms.portal.azure.com/#blade/Microsoft_AAD_IAM/GroupDetailsMenuBlade/Overview/groupId/56709ad9-8962-418a-ad0d-4b25fa962bae
Expand Down Expand Up @@ -52,6 +53,7 @@ module test_dashboard 'monitoring/stress-test-workbook.bicep' = {
scope: group
params: {
workbookDisplayName: 'Azure SDK Stress Testing - ${groupSuffix}'
location: clusterLocation
logAnalyticsResource: logWorkspace.outputs.id
}
}
Expand All @@ -61,6 +63,7 @@ module status_dashboard 'monitoring/stress-status-workbook.bicep' = {
scope: group
params: {
workbookDisplayName: 'Stress Status - ${groupSuffix}'
location: clusterLocation
logAnalyticsResource: logWorkspace.outputs.id
}
}
Expand All @@ -69,10 +72,11 @@ module cluster 'cluster/cluster.bicep' = {
name: 'cluster'
scope: group
params: {
updateNodes: updateNodes
location: clusterLocation
clusterName: clusterName
tags: tags
groupSuffix: groupSuffix
enableHighMemAgentPool: enableHighMemAgentPool
workspaceId: logWorkspace.outputs.id
}
}
Expand All @@ -88,13 +92,13 @@ module containerRegistry 'cluster/acr.bicep' = {
}

module storage 'cluster/storage.bicep' = {
name: 'storage'
scope: group
params: {
storageName: 'stressdebug${resourceSuffix}'
fileShareName: 'stressfiles${resourceSuffix}'
location: clusterLocation
}
name: 'storage'
scope: group
params: {
storageName: 'stressdebug${resourceSuffix}'
fileShareName: 'stressfiles${resourceSuffix}'
location: clusterLocation
}
}

var appInsightsInstrumentationKeySecretName = 'appInsightsInstrumentationKey-${resourceSuffix}'
Expand All @@ -109,9 +113,9 @@ var appInsightsConnectionStringSecretValue = 'APPLICATIONINSIGHTS_CONNECTION_STR
// See https://docs.microsoft.com/azure/aks/azure-files-volume#create-a-kubernetes-secret
// See https://docs.microsoft.com/azure/aks/azure-files-csi
var debugStorageKeySecretName = 'debugStorageKey-${resourceSuffix}'
var debugStorageKeySecretValue = '${storage.outputs.key}'
var debugStorageKeySecretValue = storage.outputs.key
var debugStorageAccountSecretName = 'debugStorageAccount-${resourceSuffix}'
var debugStorageAccountSecretValue = '${storage.outputs.name}'
var debugStorageAccountSecretValue = storage.outputs.name

module keyvault 'cluster/keyvault.bicep' = {
name: 'keyvault'
Expand Down Expand Up @@ -146,15 +150,15 @@ module keyvault 'cluster/keyvault.bicep' = {

module accessPolicy 'cluster/static-vault-access-policy.bicep' = {
name: 'accessPolicy'
scope: resourceGroup(staticTestSecretsKeyvaultGroup)
scope: resourceGroup(staticTestKeyvaultGroup)
params: {
vaultName: staticTestSecretsKeyvaultName
vaultName: staticTestKeyvaultName
tenantId: subscription().tenantId
objectId: cluster.outputs.secretProviderObjectId
}
}

output STATIC_TEST_SECRETS_KEYVAULT string = staticTestSecretsKeyvaultName
output STATIC_TEST_SECRETS_KEYVAULT string = staticTestKeyvaultName
output CLUSTER_TEST_SECRETS_KEYVAULT string = keyvault.outputs.keyvaultName
output SECRET_PROVIDER_CLIENT_ID string = cluster.outputs.secretProviderClientId
output CLUSTER_NAME string = cluster.outputs.clusterName
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
param logAnalyticsResource string
param location string = resourceGroup().location

@description('The friendly name for the workbook that is used in the Gallery or Saved List. This name must be unique within a resource group.')
param workbookDisplayName string
Expand Down Expand Up @@ -233,7 +234,7 @@ var workbookContent = {

resource workbookId_resource 'microsoft.insights/workbooks@2021-03-08' = {
name: workbookId
location: resourceGroup().location
location: location
kind: 'shared'
properties: {
displayName: workbookDisplayName
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
param logAnalyticsResource string
param location string = resourceGroup().location

@description('The friendly name for the workbook that is used in the Gallery or Saved List. This name must be unique within a resource group.')
param workbookDisplayName string
Expand Down Expand Up @@ -308,7 +309,7 @@ var workbookContent = {

resource workbookId_resource 'microsoft.insights/workbooks@2021-03-08' = {
name: workbookId
location: resourceGroup().location
location: location
kind: 'shared'
properties: {
displayName: workbookDisplayName
Expand Down
4 changes: 2 additions & 2 deletions tools/stress-cluster/cluster/azure/parameters/dev.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@
"clusterLocation": {
"value": "westus2"
},
"staticTestSecretsKeyvaultName": {
"staticTestKeyvaultName": {
"value": // add me, e.g. stress-secrets-<your alias>
},
"staticTestSecretsKeyvaultGroup": {
"staticTestKeyvaultGroup": {
"value": // add me, e.g. rg-stress-secrets-<your alias>
},
"tags": {
Expand Down
7 changes: 2 additions & 5 deletions tools/stress-cluster/cluster/azure/parameters/pg.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,12 @@
"clusterLocation": {
"value": "westus3"
},
"staticTestSecretsKeyvaultName": {
"staticTestKeyvaultName": {
"value": "stress-secrets-pg"
},
"staticTestSecretsKeyvaultGroup": {
"staticTestKeyvaultGroup": {
"value": "rg-stress-secrets-pg"
},
"enableHighMemAgentPool": {
"value": true
},
"tags": {
"value": {
"environment": "pg",
Expand Down
Loading

0 comments on commit 7f00c18

Please sign in to comment.