Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mac test timeout #461

Closed
davepacheco opened this issue Dec 1, 2021 · 5 comments
Closed

Mac test timeout #461

davepacheco opened this issue Dec 1, 2021 · 5 comments
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@davepacheco
Copy link
Collaborator

We've had a number of similar test failures recently in the MacOS GitHub Action. Examples:

They all look similar. Take this one from that last example:

    Running `/Users/runner/work/omicron/omicron/target/debug/deps/test_commands-64e3cf8a3819ddcb`

running 5 tests
test test_nexus_bad_config ... FAILED
test test_nexus_invalid_config ... FAILED
test test_nexus_no_args ... FAILED
test test_nexus_openapi_internal ... FAILED
test test_nexus_openapi ... FAILED

failures:

---- test_nexus_bad_config stdout ----
thread 'test_nexus_bad_config' panicked at 'timed out waiting for command for 10000 ms: /Users/runner/work/omicron/omicron/target/debug/nexus nonexistent', test-utils/src/dev/test_cmds.rs:81:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- test_nexus_invalid_config stdout ----
writing temp config: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/test_commands_config.78112.0
thread 'test_nexus_invalid_config' panicked at 'timed out waiting for command for 10000 ms: /Users/runner/work/omicron/omicron/target/debug/nexus /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/test_commands_config.78112.0', test-utils/src/dev/test_cmds.rs:81:13

---- test_nexus_no_args stdout ----
thread 'test_nexus_no_args' panicked at 'timed out waiting for command for 10000 ms: /Users/runner/work/omicron/omicron/target/debug/nexus', test-utils/src/dev/test_cmds.rs:81:13

---- test_nexus_openapi_internal stdout ----
writing temp config: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/test_commands_config.78112.8
thread 'test_nexus_openapi_internal' panicked at 'timed out waiting for command for 10000 ms: /Users/runner/work/omicron/omicron/target/debug/nexus /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/test_commands_config.78112.8 --openapi-internal', test-utils/src/dev/test_cmds.rs:81:13

---- test_nexus_openapi stdout ----
writing temp config: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/test_commands_config.78112.7
thread 'test_nexus_openapi' panicked at 'timed out waiting for command for 10000 ms: /Users/runner/work/omicron/omicron/target/debug/nexus /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/test_commands_config.78112.7 --openapi', test-utils/src/dev/test_cmds.rs:81:13


failures:
    test_nexus_bad_config
    test_nexus_invalid_config
    test_nexus_no_args
    test_nexus_openapi
    test_nexus_openapi_internal

test result: FAILED. 0 passed; 5 failed; 0 ignored; 0 measured; 0 filtered out; finished in 20.97s

error: test failed, to rerun pass '-p omicron-nexus --test test_commands'
Error: Process completed with exit code 101.

The second attempt on #454 is even more interesting:

     Running `/Users/runner/work/omicron/omicron/target/debug/deps/test_commands-3f68dfabe4278858`
running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s


running 2 tests
test test_oximeter_openapi ... FAILED
test test_oximeter_no_args ... FAILED

failures:

---- test_oximeter_openapi stdout ----
writing temp config: /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/test_commands_config.28869.0
thread 'test_oximeter_openapi' panicked at 'timed out waiting for command for 10000 ms: /Users/runner/work/omicron/omicron/target/debug/oximeter /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/test_commands_config.28869.0 --openapi', test-utils/src/dev/test_cmds.rs:85:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- test_oximeter_no_args stdout ----
thread 'test_oximeter_no_args' panicked at 'timed out waiting for command for 10000 ms: /Users/runner/work/omicron/omicron/target/debug/oximeter', test-utils/src/dev/test_cmds.rs:85:13


failures:
    test_oximeter_no_args
    test_oximeter_openapi

test result: FAILED. 0 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 10.07s

error: test failed, to rerun pass '-p oximeter-collector --test test_commands'
Error: Process completed with exit code 101.

The integration test in question runs a couple of command-line tools from Nexus with a few cases: some provide no arguments, which should immediately generate a usage message. Others provide arguments that cause the command to generate an OpenAPI schema. Both of these operations should be extremely quick and not block on much. In all the cases we've looked at, some of these commands didn't complete within 10 seconds!

(more details coming)

This was referenced Dec 1, 2021
@davepacheco
Copy link
Collaborator Author

In #454, the builds also took an extremely long time. My first thought was maybe these systems are heavily loaded. Maybe when we made the repo public last week and GitHub started paying for CI resources they use a lower class of storage? I can't find any public information confirming that they use different classes of service here, just different limits. I looked at build times and while they have increased notably in the last couple of weeks, I'm not convinced it's because of a different class of service.

@pfmooney had a good thought: maybe this is memory exhaustion? I've certainly seen a bunch of builds fail recently on my 64 GiB Helios machine due to running out of swap. This is when I dug into build times. Details are in this repo, but here's the summary of build times over time by OS:

summary

Some observations:

  • Mac is generally faster than Ubuntu.
  • Mac has way more outliers.
  • Some of these outliers are truly horrific.
  • Builds have gotten notably slower this month.
  • The mac outliers have gotten a lot worse in the last couple of weeks

It's far from a smoking gun, but this behavior seems consistent with gradually increased memory usage that recently hit a tipping point. Also consistent with that: slowness forking processes, since this involves at least reserving a bunch more memory at once.

Casting some doubt: the Mac runners get 14 GiB of memory. The Linux ones get only half that. But they also only have 2 cores, so they're probably using less concurrency.

I've proposed two mitigations:

  • bump command timeout #462 should be innocuous and would give us more slack. It won't improve the slow CI times, but it might avoid failures, which is kind of a big deal when each run takes an hour, fails spuriously, and then you have to sync up again.
  • try limiting concurrency on MacOS #460 explicitly limits concurrency in the CI tests on MacOS in attempt to reduce the number of Cockroach and Nexus instances running at once. We don't have much data on whether or how much this will help, honestly. But I'm not sure how to collect it without trying this. My experience is that memory and swap accounting is tricky, OS-dependent, workload-dependent, etc. and even that doesn't always explain pathological behavior.

@davepacheco davepacheco added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Jul 15, 2022
@davepacheco
Copy link
Collaborator Author

davepacheco commented Jul 21, 2022

Here's another data point: maybe these are running out of disk space?

Today @jgallagher observed similar timeouts on a Linux system:

error: failed to run custom build command for `nexus-test-utils v0.1.0 (/home/john/github/omicron/nexus/test-utils)`

Caused by:
  process didn't exit successfully: `/home/john/github/omicron/target/debug/build/nexus-test-utils-5397c4c0fdce8329/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rerun-if-changed=../../common/src/sql/dbinit.sql
  cargo:rerun-if-changed=../../tools/cockroachdb_checksums
  cargo:rerun-if-changed=../../tools/cockroachdb_version

  --- stderr
  Jul 21 15:44:21.881 INFO cockroach temporary directory: /tmp/.tmpCxNNgo
  Jul 21 15:44:21.881 INFO cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store=path=/home/john/github/omicron/target/debug/build/nexus-test-utils-6ca055c80912e1f7/out/crdb-base,ballast-size=0 --listen-addr 127.0.0.1:0 --listening-url-fil
e /tmp/.tmpCxNNgo/listen-url
  thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: TimedOut { pid: 1058200, time_waited: 30.001202088s }', /home/john/github/omicron/test-utils/src/dev/mod.rs:142:42
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

In that case, CockroachDB did start after a while:

CockroachDB node starting at 2022-07-21 15:47:19.720975018 +0000 UTC (took 177.5s)

Running omicron-dev db-run sometimes seemed to work, but sometimes seemed to time out. This one got stuck populating:

omicron-dev: using temporary directory for database store (cleaned up on clean exit)
omicron-dev: will run this to start CockroachDB:
cockroach start-single-node --insecure --http-addr=:0 --store=path=/tmp/.tmpcYC00N/data,ballast-size=0 --listen-addr 127.0.0.1:32221 --listening-url-file /tmp/.tmpcYC00N/listen-url
omicron-dev: temporary directory: /tmp/.tmpcYC00N
*
* WARNING: ALL SECURITY CONTROLS HAVE BEEN DISABLED!
*
* This mode is intended for non-production testing only.
*
* In this mode:
* - Your cluster is open to any client that can access 127.0.0.1.
* - Intruders with access to your machine or network can observe client-server traffic.
* - Intruders can log in without password and read or write any data in the cluster.
* - Intruders can consume all your server's resources and cause unavailability.
*
*
* INFO: To start a secure server without mandating TLS for clients,
* consider --accept-sql-without-tls instead. For other options, see:
*
* - https://go.crdb.dev/issue-v/53404/v21.2
* - https://www.cockroachlabs.com/docs/v21.2/secure-a-cluster.html
*

omicron-dev: child process: pid 1086085
omicron-dev: CockroachDB listening at: postgresql://[email protected]:32221/omicron?sslmode=disable
omicron-dev: populating database
CockroachDB node starting at 2022-07-21 16:27:49.382182541 +0000 UTC (took 1.1s)
build:               CCL v21.2.9 @ 2022/04/12 17:05:18 (go1.16.6)
webui:               http://127.0.0.1:36031
sql:                 postgresql://[email protected]:32221/defaultdb?sslmode=disable
sql (JDBC):          jdbc:postgresql://127.0.0.1:32221/defaultdb?sslmode=disable&user=root
RPC client flags:    cockroach <client cmd> --host=127.0.0.1:32221 --insecure
logs:                /tmp/.tmpcYC00N/data/logs
temp dir:            /tmp/.tmpcYC00N/data/cockroach-temp096702671
external I/O path:   /tmp/.tmpcYC00N/data/extern
store[0]:            path=/tmp/.tmpcYC00N/data
storage engine:      pebble
status:              initialized new cluster
clusterID:           d089a5b9-5636-41de-ac3d-c1bca332e71c
nodeID:              1
^CNote: a second interrupt will skip graceful shutdown and terminate forcefully
initiating graceful shutdown of server
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /tmp/.tmpcYC00N
omicron-dev: populating database: populate: populating Omicron database: db error: ERROR: polling for queued jobs to complete: poll-show-jobs: context canceled: ERROR: polling for queued jobs to complete: poll-show-jobs: context canceled

A subsequent run failed with:

omicron-dev: cockroach did not report listen URL after 30.001262353s (may still be running as pid 1087402 and temporary directory may still exist)

and then saw:

CockroachDB node starting at 2022-07-21 16:33:37.763758339 +0000 UTC (took 48.4s)

On this system there's plenty of free memory but the disk is 98% full. When using a --store-dir on tmpfs, this reliably starts up quickly.

The punchline is: we know some filesystems get very slow as the disk fills up. Maybe that's what's been going on with the Mac test runners?

@jgallagher
Copy link
Contributor

On this system there's plenty of free memory but the disk is 98% full. When using a --store-dir on tmpfs, this reliably starts up quickly.

The punchline is: we know some filesystems get very slow as the disk fills up. Maybe that's what's been going on with the Mac test runners?

Short followup: After freeing up space on my disk (utilization of ~80% instead of 98%), all my Cockroach timeout issues have gone away, which seems to confirm this reasoning. FWIW my filesystem is ext4.

@davepacheco
Copy link
Collaborator Author

I couldn't help myself from writing a small test program to see how the Mac and Linux GitHub Actions runners behave when disk space runs low. Sadly (for the purpose of debugging this issue), the Mac one behaved quite well:

WARNING: THIS PROGRAM WILL EXHAUST ALL AVAILABLE DISK SPACE!
2022-07-21T21:07:36.180041+00:00 starting
2022-07-21T21:07:38.872964+00:00 wrote 1073741824 bytes to sink_hole/file-0 (took 2.693s)
2022-07-21T21:07:41.501845+00:00 wrote 1073741824 bytes to sink_hole/file-1 (took 2.629s)
2022-07-21T21:07:44.211972+00:00 wrote 1073741824 bytes to sink_hole/file-2 (took 2.710s)
2022-07-21T21:07:46.862295+00:00 wrote 1073741824 bytes to sink_hole/file-3 (took 2.650s)
...
2022-07-21T21:16:00.119435+00:00 wrote 1073741824 bytes to sink_hole/file-172 (took 1.894s)
2022-07-21T21:16:02.778308+00:00 wrote 1073741824 bytes to sink_hole/file-173 (took 2.659s)
2022-07-21T21:16:02.784827+00:00 FAILED writing 1073741824 bytes to sink_hole/file-174 (failed write took 0.007s): write "sink_hole/file-174": No space left on device (os error 28)
Error: write "sink_hole/file-174"
Caused by:
    No space left on device (os error 28)

It failed crisply (ENOSPC) with no degradation in latency. This might not be testing the same thing our test suite does so it's possible it's still related...but this was not the confirmation I was hoping for.

Interestingly, the Linux test runner behaves much worse:

WARNING: THIS PROGRAM WILL EXHAUST ALL AVAILABLE DISK SPACE!
2022-07-21T21:07:19.935602283+00:00 starting
2022-07-21T21:07:21.429943293+00:00 wrote 1073741824 bytes to sink_hole/file-0 (took 1.494s)
2022-07-21T21:07:26.750829115+00:00 wrote 1073741824 bytes to sink_hole/file-1 (took 5.321s)
2022-07-21T21:07:32.029367358+00:00 wrote 1073741824 bytes to sink_hole/file-2 (took 5.279s)
...
2022-07-21T21:09:28.240502855+00:00 wrote 1073741824 bytes to sink_hole/file-24 (took 5.367s)
2022-07-21T21:09:33.541949629+00:00 wrote 1073741824 bytes to sink_hole/file-25 (took 5.301s)
2022-07-21T21:09:38.917331117+00:00 wrote 1073741824 bytes to sink_hole/file-26 (took 5.375s)

The runner has been sitting there for 10 minutes without having failed or emitted any other output.

@iliana
Copy link
Contributor

iliana commented Jul 29, 2022

#1508

@iliana iliana closed this as completed Jul 29, 2022
leftwo pushed a commit that referenced this issue Jul 15, 2023
Crucible:
Add quota to agent created datasets (#835)
Switch to building on heliosv2 (#830)
Minor clippy cleanup (#832)
Update to latest dropshot (#829)

Propolis:
The above crucible changes
Switch to building on heliosv2 (#461)
clean up cargo check/clippy errors when built with Rust 1.71 (#462)
Add some VMM_DESTROY_VM polish to bhyve-api
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

3 participants