-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test hung after CockroachDB died during initialization #1224
Comments
The entire temp directory (including the test log, CockroachDB data directory, Clickhouse directory, etc.) is attached to #1223. Here are the processes running
pid 26042 is the CockroachDB process that exited on SIGSEGV. So why is the test program hung? The test we're running starts like this:
Now, the
After that, all we have are repeats of those last two entries: incoming metric collector requests followed by 503 errors. The log goes on until 2022-06-17T14:56:04.528359809Z with no other entries of note. How does this timing interact with the CockroachDB exit? CockroachDB reported that it was starting to dump stderr at 22:45:17.402:
but the time of the file is 22:45:20.716:
so it took over 3 seconds to do that. (Interestingly, the test log shows successful requests that would have hit the database that started as late as 22:45:19.69231185Z, well after CockroachDB identified the fatal failure. So it was still functioning for a few seconds after that.) Anyway, this is certainly around the time that the test stopped making forward progress. Now where exactly were we? The last entry that I can pin to a particular point in setup is this one:
That's the client in the simulated sled agent reporting a successful request to Nexus to register the sled agent. That would be happen within:
which is about here: omicron/sled-agent/src/sim/server.rs Line 73 in 236f2ac
Now, in the config created by the caller, there are no zpools configured, so I think we might pop many frames back up and start the Oximeter server and producer. I don't see the "starting oximeter server" log message, but I think that's because the Oximeter config uses a different log config instead of the test suite one. I'll file a separate bug for this. At this point, I expect our stack to look like this:
Inside omicron/oximeter/collector/src/lib.rs Lines 422 to 454 in 236f2ac
and that's the request that we're seeing repeatedly in the test log failing with a 503 because the database is down. The only real fix here is to time out the test if it's taking so long. (It's been 6 hours in this state.) I'm not sure how best to do that: we could have each of the Note this would be a lot faster to debug if we had the Oximeter log because it would show very explicitly that it was trying to register itself, that it failed, and that it was planning to retry. I'll file a separate bug about that. |
I've put some data (pfiles output and core file) from the test process on catacomb:
|
Propolis changes: Add `IntrPin::import_state` and migrate LPC UART pin states (#669) Attempt to set WCE for raw file backends Fix clippy/lint nits for rust 1.77.0 Crucible changes: Correctly (and robustly) count bytes (#1237) test-replay.sh fix name of DTrace script (#1235) BlockReq -> BlockOp (#1234) Simplify `BlockReq` (#1218) DTrace, cmon, cleanup, retry downstairs connections at 10 seconds. (#1231) Remove `MAX_ACTIVE_COUNT` flow control system (#1217) Crucible changes that were in Omicron but not in Propolis before this commit. Return *410 Gone* if volume is inactive (#1232) Update Rust crate opentelemetry to 0.22.0 (#1224) Update Rust crate base64 to 0.22.0 (#1222) Update Rust crate async-recursion to 1.1.0 (#1221) Minor cleanups to extent implementations (#1230) Update Rust crate http to 0.2.12 (#1220) Update Rust crate reedline to 0.30.0 (#1227) Update Rust crate rayon to 1.9.0 (#1226) Update Rust crate nix to 0.28 (#1223) Update Rust crate async-trait to 0.1.78 (#1219) Various buffer optimizations (#1211) Add low-level test for message encoding (#1214) Don't let df failures ruin the buildomat tests (#1213) Activate the NBD server's psuedo file (#1209) --------- Co-authored-by: Alan Hanson <[email protected]>
In #1223, I found the test hung after CockroachDB died on SIGSEGV.
The context was I was running the
try_repro.sh
script to try to repro #1130. This runs (among other things)./target/debug/deps/test_all-d586ea57740e3382 test_disk_create_disk_that_already_exists_fails
in a loop. This was on commit 095dfb6, which is not from "main", but is pretty close to 236f2ac from "main".When I found the problem, the test was hung, having emitted only this at the end:
The punchline is that we were stuck during test setup waiting for Oximeter to register with Nexus. This was in an infinite backoff (that is, infinite tries, not an infinite delay) because these requests were failing because the database was down.
The text was updated successfully, but these errors were encountered: