Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helix agents not provisioning #13774

Closed
3 tasks
missymessa opened this issue Jun 5, 2023 · 6 comments
Closed
3 tasks

Helix agents not provisioning #13774

missymessa opened this issue Jun 5, 2023 · 6 comments
Assignees
Labels
1ES Team Failure related to the VMs managed by the 1ES Team Critical Detected By - Customer Issue was reported by a customer FC - Infrastructure A build failure caused by apparent infrastructure failures. Ops - First Responder RCA Requested A Root Cause Analysis (RCA) should be completed once this issue has been resolved.

Comments

@missymessa
Copy link
Member

Customer reported that it's taking a long time for build agents to provision. Investigation shows that it started suddenly.

image

IcM has been filed:

https://portal.microsofticm.com/imp/v3/incidents/details/395129564/home

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

@missymessa missymessa self-assigned this Jun 5, 2023
@missymessa missymessa added 1ES Team Failure related to the VMs managed by the 1ES Team Detected By - Customer Issue was reported by a customer Ops - First Responder labels Jun 5, 2023
@missymessa
Copy link
Member Author

Likely related: #13767

@missymessa
Copy link
Member Author

AzDO folks are saying we've hit our quota. Looking into why resources aren't being deallocated.

@missymessa missymessa added the RCA Requested A Root Cause Analysis (RCA) should be completed once this issue has been resolved. label Jun 6, 2023
@missymessa
Copy link
Member Author

(be sure that an RCA issue is opened when this closes. The functionality for that may not have been deployed yet, and if not, we'll need to manually create an RCA issue).

@missymessa missymessa changed the title Tracking Issue: NetCore1ESPool-Internal agents taking a long time to provision Helix agents not provisioning Jun 6, 2023
@missymessa missymessa added Critical FC - Infrastructure A build failure caused by apparent infrastructure failures. labels Jun 6, 2023
@missymessa missymessa transferred this issue from dotnet/dnceng Jun 6, 2023
@missymessa
Copy link
Member Author

(transferring to dotnet/arcade so that Known Issues can pick this up).

@missymessa
Copy link
Member Author

Current status:

  • Issue was caused due to a bug in Secret Manager.
  • Have redeployed Helix Machine to fix the issue with the secrets. Will need to do some manual mitigation so that VMs don't keep spinning up in a broken state.
  • Asked the AzDO team to close the IcM as we have identified the problem on our side and have managed to make progress forward.

@riarenas
Copy link
Member

riarenas commented Jun 7, 2023

All Helix VMs are now provisioning normally. The on-prem machines we lost will need to be configured again by with the updated configuration files. I'll add RCA details tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1ES Team Failure related to the VMs managed by the 1ES Team Critical Detected By - Customer Issue was reported by a customer FC - Infrastructure A build failure caused by apparent infrastructure failures. Ops - First Responder RCA Requested A Root Cause Analysis (RCA) should be completed once this issue has been resolved.
Projects
None yet
Development

No branches or pull requests

2 participants