-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mac test timeout #461
Comments
In #454, the builds also took an extremely long time. My first thought was maybe these systems are heavily loaded. Maybe when we made the repo public last week and GitHub started paying for CI resources they use a lower class of storage? I can't find any public information confirming that they use different classes of service here, just different limits. I looked at build times and while they have increased notably in the last couple of weeks, I'm not convinced it's because of a different class of service. @pfmooney had a good thought: maybe this is memory exhaustion? I've certainly seen a bunch of builds fail recently on my 64 GiB Helios machine due to running out of swap. This is when I dug into build times. Details are in this repo, but here's the summary of build times over time by OS: Some observations:
It's far from a smoking gun, but this behavior seems consistent with gradually increased memory usage that recently hit a tipping point. Also consistent with that: slowness forking processes, since this involves at least reserving a bunch more memory at once. Casting some doubt: the Mac runners get 14 GiB of memory. The Linux ones get only half that. But they also only have 2 cores, so they're probably using less concurrency. I've proposed two mitigations:
|
Here's another data point: maybe these are running out of disk space? Today @jgallagher observed similar timeouts on a Linux system:
In that case, CockroachDB did start after a while:
Running
A subsequent run failed with:
and then saw:
On this system there's plenty of free memory but the disk is 98% full. When using a --store-dir on tmpfs, this reliably starts up quickly. The punchline is: we know some filesystems get very slow as the disk fills up. Maybe that's what's been going on with the Mac test runners? |
Short followup: After freeing up space on my disk (utilization of ~80% instead of 98%), all my Cockroach timeout issues have gone away, which seems to confirm this reasoning. FWIW my filesystem is |
I couldn't help myself from writing a small test program to see how the Mac and Linux GitHub Actions runners behave when disk space runs low. Sadly (for the purpose of debugging this issue), the Mac one behaved quite well:
It failed crisply (ENOSPC) with no degradation in latency. This might not be testing the same thing our test suite does so it's possible it's still related...but this was not the confirmation I was hoping for. Interestingly, the Linux test runner behaves much worse:
The runner has been sitting there for 10 minutes without having failed or emitted any other output. |
Crucible: Add quota to agent created datasets (#835) Switch to building on heliosv2 (#830) Minor clippy cleanup (#832) Update to latest dropshot (#829) Propolis: The above crucible changes Switch to building on heliosv2 (#461) clean up cargo check/clippy errors when built with Rust 1.71 (#462) Add some VMM_DESTROY_VM polish to bhyve-api
We've had a number of similar test failures recently in the MacOS GitHub Action. Examples:
They all look similar. Take this one from that last example:
The second attempt on #454 is even more interesting:
The integration test in question runs a couple of command-line tools from Nexus with a few cases: some provide no arguments, which should immediately generate a usage message. Others provide arguments that cause the command to generate an OpenAPI schema. Both of these operations should be extremely quick and not block on much. In all the cases we've looked at, some of these commands didn't complete within 10 seconds!
(more details coming)
The text was updated successfully, but these errors were encountered: