Production - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #13766

dotnet-eng-status · 2023-06-05T13:00:28Z

💔 Metric state changed to alerting

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

WaitTime {Queue=ubuntu.2204.amd64.open.rt} 53
WaitTime {Queue=windows.11.amd64.client.open} 51

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-54aa0d7e647e46ff9e880bf6ae532b99

oleksandr-didyk · 2023-06-05T14:01:31Z

For windows.11.amd64.client.open - at the time of checking (~ 13.00 UTC) the scaleset for the queue had several instances in the Creating status, so most probably the scaling up was delayed.

For ubuntu.2204.amd64.open.rt - ~18 instances are available, but none are visible in the heartbeats table. Additionally, from the metrics displayed in AzDO it seems that some of the instances were created after the alert fired.

Will monitor the alert and close if no more scaling issues will be encountered

dotnet-eng-status · 2023-06-07T02:35:24Z

💚 Metric state changed to ok

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Go to rule

riarenas · 2023-06-07T02:36:46Z

More #13774. We expect provisioning times to stay stable now.

dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Ops - First Responder Grafana Alert Issues opened by Grafana Production Tied to the Production environment (as opposed to Staging) labels Jun 5, 2023

dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Jun 7, 2023

riarenas closed this as completed Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #13766

Production - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #13766

dotnet-eng-status bot commented Jun 5, 2023

oleksandr-didyk commented Jun 5, 2023

dotnet-eng-status bot commented Jun 7, 2023

riarenas commented Jun 7, 2023

Production - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #13766

Production - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #13766

Comments

dotnet-eng-status bot commented Jun 5, 2023

oleksandr-didyk commented Jun 5, 2023

dotnet-eng-status bot commented Jun 7, 2023

riarenas commented Jun 7, 2023