Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test hung after CockroachDB died during initialization #1224

Open
davepacheco opened this issue Jun 17, 2022 · 2 comments
Open

test hung after CockroachDB died during initialization #1224

davepacheco opened this issue Jun 17, 2022 · 2 comments
Labels
database Related to database access Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.

Comments

@davepacheco
Copy link
Collaborator

davepacheco commented Jun 17, 2022

In #1223, I found the test hung after CockroachDB died on SIGSEGV.

The context was I was running the try_repro.sh script to try to repro #1130. This runs (among other things) ./target/debug/deps/test_all-d586ea57740e3382 test_disk_create_disk_that_already_exists_fails in a loop. This was on commit 095dfb6, which is not from "main", but is pretty close to 236f2ac from "main".

When I found the problem, the test was hung, having emitted only this at the end:

ATTEMPT 246

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails has been running for over 60 seconds

The punchline is that we were stuck during test setup waiting for Oximeter to register with Nexus. This was in an infinite backoff (that is, infinite tries, not an infinite delay) because these requests were failing because the database was down.

@davepacheco
Copy link
Collaborator Author

The entire temp directory (including the test log, CockroachDB data directory, Clickhouse directory, etc.) is attached to #1223. Here are the processes running

21113  /bin/bash ./try_repro.sh
  26037  ../target/debug/deps/test_all-d586ea57740e3382 test_disk_create_disk_that_alrea
    26042  <defunct>
    26046  clickhouse server --log-file /dangerzone/omicron_tmp/try_repro.21113/.tmpUduhcD

pid 26042 is the CockroachDB process that exited on SIGSEGV. So why is the test program hung?

The test we're running starts like this:

#[nexus_test]
async fn test_disk_create_disk_that_already_exists_fails(
    cptestctx: &ControlPlaneTestContext,
) {
    let client = &cptestctx.external_client;
    DiskTest::new(&cptestctx).await;

Now, the #[nexus-test] macro causes us to invoke nexus_test_utils::test_setup, and I believe we're hung inside there. I'll get to that in a minute. The last entries leading up to the problem are these:

[2022-06-16T22:45:19.62853269Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/ServerContext/26037 on ivanova: created 1 silo user role assignments
[2022-06-16T22:45:19.629129481Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: setting up sled agent server (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.676782022Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: created simulated sled agent (server=b6d65341-167c-41df-9b5c-41cded99c229, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, sim_mode=Explicit)
[2022-06-16T22:45:19.681422368Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/disks/{disk_id}, method=PUT)
[2022-06-16T22:45:19.681904141Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/disks/{disk_id}/poke, method=POST)
[2022-06-16T22:45:19.682314018Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/instances/{instance_id}, method=PUT)
[2022-06-16T22:45:19.682701844Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/instances/{instance_id}/poke, method=POST)
[2022-06-16T22:45:19.683034514Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: registered endpoint (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229, path=/update, method=POST)
[2022-06-16T22:45:19.683373236Z]  INFO: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: listening (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.68366514Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: successfully registered DTrace USDT probes (local_addr=[::1]:48400, sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.683941303Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: contacting server nexus (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.684179965Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: client request (sled_id=b6d65341-167c-41df-9b5c-41cded99c229, body=Some(Body), method=POST)
    uri: http://127.0.0.1:51560/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229
[2022-06-16T22:45:19.690898932Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: accepted connection (local_addr=127.0.0.1:51560, remote_addr=127.0.0.1:49049)
[2022-06-16T22:45:19.69231185Z] TRACE: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: incoming request (req_id=786bab2b-6b7e-4480-854d-fd403ed6283f, uri=/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229, method=POST, remote_addr=127.0.0.1:49049, local_addr=127.0.0.1:51560)
[2022-06-16T22:45:19.692820174Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/ServerContext/26037 on ivanova: registered sled agent (sled_uuid=b6d65341-167c-41df-9b5c-41cded99c229)
[2022-06-16T22:45:19.736664225Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: request completed (req_id=786bab2b-6b7e-4480-854d-fd403ed6283f, uri=/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229, method=POST, remote_addr=127.0.0.1:49049, local_addr=127.0.0.1:51560, response_code=204)
[2022-06-16T22:45:19.737492786Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: client response (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
    result: Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(51560), path: "/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229", query: None, fragment: None }, status: 204, headers: {"x-request-id": "786bab2b-6b7e-4480-854d-fd403ed6283f", "date": "Thu, 16 Jun 2022 22:45:19 GMT"} })
[2022-06-16T22:45:19.942180224Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: accepted connection (local_addr=127.0.0.1:51560, remote_addr=127.0.0.1:39145)
[2022-06-16T22:45:19.942593397Z] TRACE: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: incoming request (req_id=a335d9db-2db1-4cfc-93de-31431b047b83, uri=/metrics/collectors, method=POST, remote_addr=127.0.0.1:39145, local_addr=127.0.0.1:51560)
[2022-06-16T22:45:50.727619201Z]  INFO: a0404ffe-187e-4aff-8fff-66edacfb6a29/dropshot_internal/26037 on ivanova: request completed (req_id=a335d9db-2db1-4cfc-93de-31431b047b83, uri=/metrics/collectors, method=POST, remote_addr=127.0.0.1:39145, local_addr=127.0.0.1:51560, error_message_external="Service Unavailable", error_message_internal="Timeout accessing connection pool", response_code=503)

After that, all we have are repeats of those last two entries: incoming metric collector requests followed by 503 errors. The log goes on until 2022-06-17T14:56:04.528359809Z with no other entries of note.

How does this timing interact with the CockroachDB exit? CockroachDB reported that it was starting to dump stderr at 22:45:17.402:

$ grep 'stderr capture' /dangerzone/omicron_tmp/try_repro.21113/.tmptw5KEL/data/logs/cockroach-stderr.ivanova.dap.2022-06-16T22_45_17Z.026042.log 
I220616 22:45:17.402378 1 util/log/flags.go:201  [-] 1  stderr capture started

but the time of the file is 22:45:20.716:

$ TZ=utc ls -lErt /dangerzone/omicron_tmp/try_repro.21113/.tmptw5KEL/data/logs/cockroach-stderr.ivanova.dap.2022-06-16T22_45_17Z.026042.log 
-rw-r-----   1 dap      staff     287541 2022-06-16 22:45:20.716442675 +0000 /dangerzone/omicron_tmp/try_repro.21113/.tmptw5KEL/data/logs/cockroach-stderr.ivanova.dap.2022-06-16T22_45_17Z.026042.log

so it took over 3 seconds to do that. (Interestingly, the test log shows successful requests that would have hit the database that started as late as 22:45:19.69231185Z, well after CockroachDB identified the fatal failure. So it was still functioning for a few seconds after that.)

Anyway, this is certainly around the time that the test stopped making forward progress. Now where exactly were we? The last entry that I can pin to a particular point in setup is this one:

[2022-06-16T22:45:19.737492786Z] DEBUG: test_disk_create_disk_that_already_exists_fails/omicron_sled_agent::sim::Server/26037 on ivanova: client response (sled_id=b6d65341-167c-41df-9b5c-41cded99c229)
    result: Ok(Response { url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Ipv4(127.0.0.1)), port: Some(51560), path: "/sled_agents/b6d65341-167c-41df-9b5c-41cded99c229", query: None, fragment: None }, status: 204, headers: {"x-request-id": "786bab2b-6b7e-4480-854d-fd403ed6283f", "date": "Thu, 16 Jun 2022 22:45:19 GMT"} })

That's the client in the simulated sled agent reporting a successful request to Nexus to register the sled agent. That would be happen within:

omicron_sled_agent::sim::Server::start()
nexus_test_utils::start_sled_agent()
nexus_test_utils::test_setup_with_config()
nexus_test_utils::test_setup()
#[nexus_test] macro

which is about here:

.cpapi_sled_agents_post(

Now, in the config created by the caller, there are no zpools configured, so I think we might pop many frames back up and start the Oximeter server and producer. I don't see the "starting oximeter server" log message, but I think that's because the Oximeter config uses a different log config instead of the test suite one. I'll file a separate bug for this.

At this point, I expect our stack to look like this:

oximeter::collector::Oximeter::new()
nexus_test_utils::start_oximeter()
nexus_test_utils::test_setup_with_config()
nexus_test_utils::test_setup()
#[nexus_test] macro

Inside new(), we make the repeated attempts to register Oximeter with Nexus:

// Notify Nexus that this oximeter instance is available.
let client = reqwest::Client::new();
let notify_nexus = || async {
debug!(log, "contacting nexus");
client
.post(format!(
"http://{}/metrics/collectors",
config.nexus_address
))
.json(&nexus_client::types::OximeterInfo {
address: server.local_addr().to_string(),
collector_id: agent.id,
})
.send()
.await
.map_err(backoff::BackoffError::transient)?
.error_for_status()
.map_err(backoff::BackoffError::transient)
};
let log_notification_failure = |error, delay| {
warn!(
log,
"failed to contact nexus, will retry in {:?}", delay;
"error" => ?error
);
};
backoff::retry_notify(
backoff::internal_service_policy(),
notify_nexus,
log_notification_failure,
)
.await
.expect("Expected an infinite retry loop contacting Nexus");

and that's the request that we're seeing repeatedly in the test log failing with a 503 because the database is down.

The only real fix here is to time out the test if it's taking so long. (It's been 6 hours in this state.) I'm not sure how best to do that: we could have each of the nexus_test_utils::start_... functions time out their attempt to start the respective server (oximeter in this case), or we could have some top-level test timeout. On the other hand, for a problem like this, it's possible we'll want core files and other data (pfiles, pldd, pargs, pargs -e) output for all the processes involved. Unless we're going to save all that, it might be better to let it hang so a person can come in and debug it as I just did.

Note this would be a lot faster to debug if we had the Oximeter log because it would show very explicitly that it was trying to register itself, that it failed, and that it was planning to retry. I'll file a separate bug about that.

@davepacheco davepacheco added the Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken. label Jul 15, 2022
@davepacheco
Copy link
Collaborator Author

I've put some data (pfiles output and core file) from the test process on catacomb:

dap@catacomb /data/staff/core/issues $ ls -l omicron-1224
total 192851
-rw-rw-r--+  1 dap      staff    323863373 Jun 17 16:58 core.26037
-rw-rw-r--+  1 dap      staff       3013 Jun 17 16:58 pfiles.26037
-rw-rw-r--+  1 dap      staff       2229 Jun 17 16:59 pfiles.core
dap@catacomb /data/staff/core/issues $ 

@smklein smklein added the database Related to database access label Oct 3, 2023
leftwo added a commit that referenced this issue Mar 29, 2024
Propolis changes:
Add `IntrPin::import_state` and migrate LPC UART pin states (#669)
Attempt to set WCE for raw file backends
Fix clippy/lint nits for rust 1.77.0

Crucible changes:
Correctly (and robustly) count bytes (#1237)
test-replay.sh fix name of DTrace script (#1235)
BlockReq -> BlockOp (#1234)
Simplify `BlockReq` (#1218)
DTrace, cmon, cleanup, retry downstairs connections at 10 seconds.
(#1231)
Remove `MAX_ACTIVE_COUNT` flow control system (#1217)

Crucible changes that were in Omicron but not in Propolis before this commit.
Return *410 Gone* if volume is inactive (#1232)
Update Rust crate opentelemetry to 0.22.0 (#1224)
Update Rust crate base64 to 0.22.0 (#1222)
Update Rust crate async-recursion to 1.1.0 (#1221)
Minor cleanups to extent implementations (#1230)
Update Rust crate http to 0.2.12 (#1220)
Update Rust crate reedline to 0.30.0 (#1227)
Update Rust crate rayon to 1.9.0 (#1226)
Update Rust crate nix to 0.28 (#1223)
Update Rust crate async-trait to 0.1.78 (#1219)
Various buffer optimizations (#1211)
Add low-level test for message encoding (#1214)
Don't let df failures ruin the buildomat tests (#1213)
Activate the NBD server's psuedo file (#1209)

---------

Co-authored-by: Alan Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to database access Test Flake Tests that work. Wait, no. Actually yes. Hang on. Something is broken.
Projects
None yet
Development

No branches or pull requests

2 participants