-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cirrus: Failed to stop: java something #18065
Comments
Could also be on the google-cloud end. Not only does Cirrus-CI (the mother-ship) run there, but so do most of the VMs. I noticed off-hand there were some problems reported there the other day. There's an outage history page but there are so many services, I have no idea what to look for 😞
And money: I'd be willing to bet we see a few VMs turn up on the orphan report in a day or so. |
I'll pass this on to Cirrus-support, they'll know better what the java traceback actually means. |
No current orphans on GCP, but I remember we did have a handful on EC2. Unfortunately there's no easy way to associate EC2 VMs with the task that spawned them. So this is just speculation. |
"EC2" means "aarch64", doesn't it? I don't see any aarch64 failures in my flake history. (I do have flake history, just, no logs). |
EC2 is for the aarch64, podman-machine, and windows tests. IIRC, the orphaned VMs I killed the other day were for all three (one of each). |
If the IAM Cirrus is using has task:
experimental: true
ec2_instance:
# ... This way you'll see pretty names for the VMs on AWS. As for the reported two tasks. These are on GCP and seems the error happens after Cirrus successfully requests deletion of an instance so things should be cleanedup. I've added handling of such error so tasks won't get in failed state. The orphan AWS instances are concerning and I'll appreciate if you can provide some info about the VMs if you have. I might try to find some information in internal logs. Ideally I'll appreciate if you can add the permissions and enable |
@fkorotkov thanks. IIRC, I already setup the EC2 tagging permissions so I think all that's needed is turning on the The other thing that occurred to me is: I/we can examine the EC2 instance user-data (script). I think the task or build ID is passed on the Cirrus-agent command-line, no? In any case, I'll keep an eye out for these and let you know next time it happens. |
In GCP, user specified VM names are required upon creation. Cirrus-CI generates helpful names containing the task-ID. Unfortunately in EC2 the VM ID's are auto-generated, and special permissions are required to allow secondary setting of a `Name` tag. Since this permission has been granted, enable the `experimental` flag on EC2 tasks so that cirrus can update VM name-tags. This is especially useful in troubleshooting orphaned VMs. Ref: containers#18065 (comment) Signed-off-by: Chris Evich <[email protected]>
@cevich good catch! Yes, the user data contains a bootstrap script that has a task id and credentials for reporting updates and logs. |
Thanks for confirming. I'll keep my eyes out and let you know. |
A friendly reminder that this issue had no activity for 30 days. |
I think this is fixed -- we have not seen "failed stop stop java something" since April 5. "Failed to start" continues to hit us, but that's an old classic that's been with us forever and ever, and since it fails quickly it's not as big a deal. [Cirrus] Failed to start
[Cirrus] Instance failed to start!
|
Seeing this a lot lately, multiple times per day:
e.g. here, here. Unfortunately my flake logger has a bug, it isn't preserving log links for these. I'll track that down when I have time. ITM this is still worth filing.
Obvious first thought is that it's a problem on the Cirrus end, but I see nothing reported there.
We're also seeing lots of
Failed to start
, but those aren't as bad, because those happen within a few minutes. Thestop
ones happen after the task has finished, and all tests have run (possibly even passed), so it's a big waste of energy and time.The text was updated successfully, but these errors were encountered: