roachtest: c2c/shutdown/dest/worker failed #102111

cockroach-teamcity · 2023-04-24T09:20:06Z

roachtest.c2c/shutdown/dest/worker failed with artifacts on master @ 1f3419e178bdba544f74d9c9e14a4682efd18028:

test artifacts and logs in: /artifacts/c2c/shutdown/dest/worker/run_1
(sql_runner.go:218).Scan: error scanning '&{0xc0039a3200 <nil>}': pq: system-jobs-scan: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"
(1) pq: system-jobs-scan: rpc error: code = Unavailable desc = connection error: desc = "transport: error while dialing: connection interrupted (did the remote node shut down or are there networking issues?)"
Error types: (1) *pq.Error
(monitor.go:127).Wait: monitor failure: monitor task failed: context canceled while waiting for job to finish: context canceled
(test_runner.go:1087).func1: 1 dead node(s) detected

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=8 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-27286}

The text was updated successfully, but these errors were encountered:

msbutler · 2023-04-27T19:35:45Z

This is a new infra flake which may point to an unrelated bug. A few seconds after we issue the cutover cmd, the test ungracefully shuts down node 8. On node 6, we loop and check that the ingestion job has succeeded. For reasons I don't understand, our query on crdb_internal.system_jobs fails, causing the whole roachtest to fail. I saw the following logs on node 7, suggesting that node 8's shutdown may have something to with it.

W230424 09:18:58.422636 59663 sql/colflow/colrpc/outbox.go:189 ⋮ [T1,n3,f‹d3a67561›,distsql.stmt=‹WITH latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE (info_key = '_') AND (job_id = $1) ORDER BY written DESC LIMIT _), latestprogress AS (SELECT job_id, value FROM system.job_info AS p    rogress WHERE (info_key = '_') AND (job_id = $1) ORDER BY written DESC LIMIT _) SELECT id, status, created, payload.value AS payload, progress.value AS progress, created_by_type, created_by_id, claim_session_id, claim_instance_id, num_runs, last_run, job_type FROM system.jobs AS j INNE›,distsql.gateway=‹2›,dist    sql.appname=‹$ internal-system-jobs-scan›,distsql.txn=‹916f0ba6-a394-46bd-98e6-8c667e18943c›,streamID=‹9›] 308 +Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString
 I230424 09:18:58.422807 59610 rpc/context.go:2302 ⋮ [T1,n3,rnode=4,raddr=‹10.142.1.241:26257›,class=default,rpc] 309  connection heartbeat loop ended with err: initial connection heartbeat failed: grpc: ‹connection error: desc = "transport: error while dialing: dial tcp 10.142.1.241:26257: connect: connection r    efused"› [code 14/Unavailable]

This patch addresses to roachtest failure modes: - Prevents roachtest failure if a query fails during a node shutdown. - Prevents the src cluster from returning a single node topology, which could cause the stream ingestion job to hang if the participating src node gets shut down. Longer term, automatic replanning will prevent this. Fixes cockroachdb#101898 Fixes cockroachdb#102111 Release note: None

This patch addresses to roachtest failure modes: - Prevents roachtest failure if a query fails during a node shutdown. - Prevents the src cluster from returning a single node topology, which could cause the stream ingestion job to hang if the participating src node gets shut down. Longer term, automatic replanning will prevent this. Specifically, this patch changes the kv workload to split and scatter the kv table across the cluster before the c2c job begins. Fixes cockroachdb#101898 Fixes cockroachdb#102111 This patch also makes it easier to reproduce c2c roachtest failures by plumbing a random seed to several components of the roachtest driver. Release note: None

101786: workload: introduce timeout for pre-warming connection pool r=sean- a=sean- Interrupting target instances during prewarming shouldn't cause workload to proceed: introduce a timeout to prewarming connections. Connections will have 15s to 5min to warmup before the context will expire. Epic: none 101987: cli/sql: new option autocerts for TLS client cert auto-discovery r=rafiss a=knz Fixes #101986. See the release note below. An additional benefit not mentioned in the release note is that it simplifies switching from one tenant to another when using shared-process multitenancy. For example, this becomes possible: ``` > CREATE TENANT foo; > ALTER TENANT foo START SERVICE SHARED; > \c cluster:foo root - - autocerts ``` Alternatively, this can also be used to quickly switch from a non-root user in an app tenant to the root user in the system tenant: ``` > \c cluster:system root - - autocerts ``` This works because (currently) all tenant servers running side-by-side use the same TLS CA to validate SQL client certs. ---- Release note (cli change): The `\connect` client-side command for the SQL shell (included in `cockroach sql`, `cockroach demo`, `cockroach-sql`) now recognizes an option `autocerts` as last argument. When provided, `\c` will now try to discover a TLS client certificate and key in the same directory(ies) as used by the previous connection URL. This feature makes it easier to switch usernames when TLS client/key files are available for both the previous and the new username. 102382: c2c: deflake c2c/shutdown roachtests r=stevendanna a=msbutler c2c: deflake c2c/shutdown roachtests This patch addresses to roachtest failure modes: - Prevents roachtest failure if a query fails during a node shutdown. - Prevents the src cluster from returning a single node topology, which could cause the stream ingestion job to hang if the participating src node gets shut down. Longer term, automatic replanning will prevent this. Specifically, this patch changes the kv workload to split and scatter the kv table across the cluster before the c2c job begins. Fixes #101898 Fixes #102111 This patch also makes it easier to reproduce c2c roachtest failures by plumbing a random seed to several components of the roachtest driver. Release note: None c2c: rename completeStreamIngestion to applyCutoverTime Release note: none workload: add --scatter flag to kv workload The user can now run `./workload init kv --scatter ....` which scatters the kv table across the cluster after the initial data load. This flag is best used with `--splits`, `--max-block-bytes`, and `--insert-count`. Epic: none Release note: none 102819: admission: move CreateTime-sequencing below-raft r=irfansharif a=irfansharif These are already reviewed commits from #98308. Part of #95563. --- **admission: move CreateTime-sequencing below-raft** We move kvflowsequencer.Sequencer and its use in kvflowhandle.Handle (above-raft) to admission.sequencer, now used by admission.StoreWorkQueue (below-raft). This variant appeared in an earlier revision of #97599 where we first introduced monotonically increasing CreateTimes for a given raft group. In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll observe that it's quite difficult to create sequencing CreateTimes[^1] above raft. This is because these sequence numbers are encoded as part of the raft proposal[^2], and at encode-time, we don't actually know what log position the proposal is going to end up in. It's hard to explicitly guarantee that a proposal with log-position P1 will get encoded before another with log position P2, where P1 < P2. Naively sequencing CreateTimes at proposal-encode-time could result in over-admission. This is because of how we return flow tokens -- up to some log index[^3], and how use these sequence numbers in below-raft WorkQueues. If P2 ends up with a lower sequence number/CreateTime, it would get admitted first, and when returning flow tokens by log position, in specifying up-to-P2, we'll early return P1's flow tokens despite it not being admitted. So we'd over-admit at the sender. This is all within a <tenant,priority> pair. [^1]: We use CreateTimes as "sequence numbers" in replication admission control. We want to assign each AC-queued work below-raft a "sequence number" for FIFO ordering within a <tenant,priority>. We ensure these timestamps are roughly monotonic with respect to log positions of replicated work by sequencing work in log position order. [^2]: In kvflowcontrolpb.RaftAdmissionMeta. [^3]: See kvflowcontrolpb.AdmittedRaftLogEntries. --- **admission: add intercept points for when replicated work gets admitted** In a subsequent commit, when integrating kvflowcontrol into the critical path for replication traffic, we'll set up the return of flow tokens from the receiver node back to the sender once log entries get (asynchronously) admitted[^4]. So we need to intercept the exact points at which the virtually enqueued work items get admitted, since it all happens asynchronously[^5]. To that end we introduce the following interface: ```go // OnLogEntryAdmitted is used to observe the specific entries // (identified by rangeID + log position) that were admitted. Since // admission control for log entries is asynchronous/non-blocking, // this allows callers to do requisite post-admission // bookkeeping. type OnLogEntryAdmitted interface { AdmittedLogEntry( origin roachpb.NodeID, /* node where the entry originated */ pri admissionpb.WorkPriority, /* admission priority of the entry */ storeID roachpb.StoreID, /* store on which the entry was admitted */ rangeID roachpb.RangeID, /* identifying range for the log entry */ pos LogPosition, /* log position of the entry that was admitted*/ ) } ``` For now we pass in a no-op implementation in production code, but this will change shortly. Seeing as how the asynchronous admit interface is going to be the primary once once we enable replication admission control by default, for IO control, we no longer need the storeWriteDone interfaces and corresponding types. It's being used by our current (and soon-to-be legacy) above-raft IO admission control to inform granters of when the write was actually done, post-admission. For above-raft IO control, at admit-time we do not have sizing info for the writes, so by intercepting these writes at write-done time we're able to make any outstanding token adjustments in the granter. To reflect this new world, we: - Rename setAdmittedDoneModels to setLinearModels. - Introduce a storeReplicatedWorkAdmittedInfo[^6]. It provides information about the size of replicated work once it's admitted (which happens asynchronously from the work itself). This lets us use the underlying linear models for L0 {writes,ingests} to deduct an appropriate number of tokens from the granter, for the admitted work size[^7]. - Rename the granterWithStoreWriteDone interface to granterWithStoreReplicatedWorkAdmitted. We'll still intercept the actual point of admission for some token adjustments, through the the storeReplicatedWorkAdmittedLocked API shown below. There are two callstacks through which this API gets invoked, one where the coord.mu is already held, and one where it isn't. We plumb this information through so the lock is acquired if not already held. The locking structure is unfortunate, but this was a minimally invasive diff. ```go storeReplicatedWorkAdmittedLocked( originalTokens int64, admittedInfo storeReplicatedWorkAdmittedInfo, ) (additionalTokens int64) ``` While here, we also export an admission.TestingReverseWorkPriorityDict. There are at least three tests that have re-invented the wheel. [^4]: This will happen through the kvflowcontrol.Dispatch interface introduced back in #97766, after integrating it with the RaftTransport layer. [^5]: Introduced in #97599, for replicated write work. [^6]: Identical to the previous StoreWorkDoneInfo. [^7]: There's a peculiarity here in that at enqueuing-time we actually know the size of the write, so we could have deducted the right number of tokens upfront and avoid this post-admit granter token adjustment. We inherit this structure from earlier, and just leave a TODO for now. 103116: generate-logic-test: fix incorrect timeout in logictests template r=rickystewart a=healthy-pod In #102719, we changed the way we set `-test.timeout` but didn't update the logictests template. This code change updates the template. Release note: None Epic: none Co-authored-by: Sean Chittenden <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Michael Butler <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: healthy-pod <[email protected]>

This patch addresses to roachtest failure modes: - Prevents roachtest failure if a query fails during a node shutdown. - Prevents the src cluster from returning a single node topology, which could cause the stream ingestion job to hang if the participating src node gets shut down. Longer term, automatic replanning will prevent this. Specifically, this patch changes the kv workload to split and scatter the kv table across the cluster before the c2c job begins. Fixes cockroachdb#101898 Fixes cockroachdb#102111 This patch also makes it easier to reproduce c2c roachtest failures by plumbing a random seed to several components of the roachtest driver. Release note: None

cockroach-teamcity added this to the 23.1 milestone Apr 24, 2023

msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Apr 26, 2023

msbutler self-assigned this Apr 26, 2023

msbutler mentioned this issue Apr 27, 2023

c2c: deflake c2c/shutdown roachtests #102382

Merged

craig bot closed this as completed in 093e2dd May 11, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: c2c/shutdown/dest/worker failed #102111

roachtest: c2c/shutdown/dest/worker failed #102111

cockroach-teamcity commented Apr 24, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Apr 27, 2023 •

edited

Loading

roachtest: c2c/shutdown/dest/worker failed #102111

roachtest: c2c/shutdown/dest/worker failed #102111

Comments

cockroach-teamcity commented Apr 24, 2023 • edited by cockroach-jira-scripts Loading

msbutler commented Apr 27, 2023 • edited Loading

cockroach-teamcity commented Apr 24, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Apr 27, 2023 •

edited

Loading