Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production - [Alerting] Autoscale: Minutes to scale-up from zero machine alert #13766

Closed
dotnet-eng-status bot opened this issue Jun 5, 2023 · 3 comments
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)

Comments

@dotnet-eng-status
Copy link

💔 Metric state changed to alerting

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

  • WaitTime {Queue=ubuntu.2204.amd64.open.rt} 53
  • WaitTime {Queue=windows.11.amd64.client.open} 51

Metric Graph

Go to rule

@dotnet/dnceng, please investigate

Automation information below, do not change

Grafana-Automated-Alert-Id-54aa0d7e647e46ff9e880bf6ae532b99

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Ops - First Responder Grafana Alert Issues opened by Grafana Production Tied to the Production environment (as opposed to Staging) labels Jun 5, 2023
@oleksandr-didyk
Copy link
Contributor

For windows.11.amd64.client.open - at the time of checking (~ 13.00 UTC) the scaleset for the queue had several instances in the Creating status, so most probably the scaling up was delayed.

For ubuntu.2204.amd64.open.rt - ~18 instances are available, but none are visible in the heartbeats table. Additionally, from the metrics displayed in AzDO it seems that some of the instances were created after the alert fired.

Will monitor the alert and close if no more scaling issues will be encountered

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Jun 7, 2023
@dotnet-eng-status
Copy link
Author

💚 Metric state changed to ok

Scale up issue: A queue has been waiting for a machine to scale up for more than 45 minutes, there are no machines in this queue, which could cause a lot of work to get stuck.

Wiki link for investigation and mitigation steps here

Metric Graph

Go to rule

@riarenas
Copy link
Member

riarenas commented Jun 7, 2023

More #13774. We expect provisioning times to stay stable now.

@riarenas riarenas closed this as completed Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Critical Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)
Projects
None yet
Development

No branches or pull requests

2 participants