-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cirrus CI: agent stopped responding #8068
Comments
@edsantiago Ref: https://cirrus-ci.org/faq/#ci-agent-stopped-responding Note: There's a difference between this error happening on a VM-backed task and a container task. Otherwise, I did mail support the other week about this, and they had also noted an recent uptick in these errors. However, since the agent process is the most basic of all the Cirrus-CI moving parts, the issue is intermittent so it's going to be really hard to pin down a precise cause. It could be anything from memory/cpu exhaustion to network hiccups to outright cloud-VM failures or any combination of those 😕 Though it's possible that #8080 could help. I'd be curious to know if there's any change in the occurrence rate once that merges. |
Oh I'd also comment that: I have played around on a live VM while tests were running, and deliberately |
FWIW: Google does have a way to monitor things like CPU/Memory usage (through an agent of their own). However, having played a bit with the logging/reporting side of this, it's very much more intended to support long-running VMs. This also has the added problem of, OOM/CPU/Network problems could also potentially impact the logging agent itself 😖 Worse, if a VM dies (and therefor also the logging agent) we stand a good chance of loosing the most important/valuable data (i.e. the set not sent to the logging service). [sigh] Honestly I'm really not sure what can be done here @edsantiago. I feel that the automatic-retry Cirruc-CI performs in these situations, might be the best we can do - Assuming lack of some other obvious failure pattern. I know the Cirrus-CI service-side of the equation is well monitored, so failures on their end will be dealt with (having met some threshold). It's great that you're collecting these statistics, and that may prove useful in the future...even (as in this case) if the results are seemingly random. |
You keep saying "automatic-retry". I've never seen that: it has been my experience that we all have to manually restart these failures, even hours afterward. I think there is a disconnect somewhere. |
Let me see if I can find an example for you... |
...here: In PR #8021
While I'm looking at it...it's impossible for us to deduce any pattern from the Curiously...eyeballing your list above. The 'sys remote' lists appear to have the greatest number of incidents. This is likely the best, most actionable data-set, given they operate one at a time and in a fixed order. Is there any way you can include task "duration" with your reporting (for the above, and going-forward)? That might help narrow down if a particular test is causing trouble, given a (roughly) fixed execution speed. |
I've spent quite some time pondering this issue today. I double-checked, and confirmed we are using the "Premium" (highest) tier networking in GCP. Cirrus-CI uses GKE in the same zone as our VMs, though the packets would be routed over the internet briefly, the latency (from agent-processes to their service) will be quite low. Beyond complaining about some specific outage or hiccup, there's not much we can control on the networking side of things. Similarly, we can't control for problems within the agent process itself (bugs). Though for this class of issue, Cirrus-CI would notice wide-spread problems (beyond just our projects). Really, the only thing we can have any direct control over is what we do on the VMs, and how they're setup. Unfortunately, for this class of agent-affecting problems, we completely loose all traces of the cause when Cirrus-CI clobbers the VM 😖 Focusing on what we can control, and following Occam's Razor, it seems to me the simplest theory would be:
@baude @mheon WDYT about if we (even temporarily) disable the ginkgo test-randomization? The idea being, tests would run in roughly the same order every time. If there's a "bad" test (or podman operation), the failures would cluster around certain VM lifetimes before failure. If there's a way for @edsantiago to log the "agent-failed" flake task ID's, I can pull the timing data via the Cirrus-CI GraphQL interface. Despite loosing the logs at time of failure, pinning down a duration-before-failure, would greatly reduce the number of haystacks we need to search through. For example, if jobs run in sequence, and we graph the time-to-failure, we might be able to see failure-clusters which can then be loosely correlated to a smaller number of tests. That would be very conducive to more intense scrutiny on those tests only, narrowing down the set of possible causes. Save for additional insights...this is about the best I can do to support resolution of this issue. In any case, it's going to be a LONG and slow slog 😞 |
@cevich the error is happening on |
Yes...hmmm, maybe this is an okay starting place then to check if it's a similar set of tests which are causing trouble? Another thought I just had: It's possible to block destruction of a VM (by cirrus), from on the VM itself (using gcloud). So, maybe it's time for a temporary PR with that change. Perhaps script it to allow VM destruction only after the system tests complete. Then we simply re-run the tests repeatedly until we're left with a set of orphaned VMs, representing only failure cases. |
Here's a new one:
|
@edsantiago Thanks for reporting, this is a different/unrelated problem. IIRC it's networking related, something to do with the backend of Cirrus-CI. They're aware of the problem and have taken steps to lower occurrences (extending timeouts / adding retry capability). |
I Managed to catch three Fedora VMs in one go that hit this problem. The VMs are completely hung, inaccessible by ssh and don't respond to pings. However, since I blocked Cirrus-CI from removing them, I was able to examine their serial-console output. This is what appears (more or less) on every one:
The great news about this is: I can work around it by switching to a different I/O scheduler. |
Welp...it appears there are a TON of bugs open and re-opened on this: https://bugzilla.redhat.com/buglist.cgi?quicksearch=bfq So it's not only us seeing this problem re-emerge. One of the bugs contains a lively discussion re: resetting the default to 'deadline' for all types of storage, going forward. |
@cevich nice work! Thank you! |
Your welcome...I'm really happy it wasn't some obscure cause as I feared, and something I've seen before. After reading through the various bugzillas, this is definitely a re-emergence in Fedora and now affecting RHEL also. Previously, the podman integration tests were instrumental in helping upstream develop a fix. However, that took MONTHS, and being twice-burned by this, I'm shy about participating again 😕 My current (Friday) thinking: Insert head into pile of finely granulated silicates. |
This is now far and away the most common CI flake.
make nixpkgs
#7961podman pod inspect
#8021make nixpkgs
#7961podman pod inspect
#8021podman pod inspect
#8021The text was updated successfully, but these errors were encountered: