-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: splits/largerange/size=32GiB,nodes=6 failed [raft snaps; needs #106813] #104588
Comments
while querying |
Could be an instance of #102936 |
First error mentioning
Exactly the same poisoned latch (key
Between 13:41:44-13:42:15, there are a few "poisoned latch" errors with the same key logged on nodes 1, 2, 4, 5 (from line
|
Salvaged the artifacts, https://drive.google.com/file/d/16bkGUEoTaTMg9XeAXOjX8pfsHebxFS7B/view?usp=share_link |
@tbg says:
@erikgrinaker says:
|
Contention events log has something to say about the aborted transaction
{
"count" : "1",
"cumulative_contention_time" : "00:00:00.250195",
"index_id" : "106",
"key" : "\\x03f289f9071d8bd5",
"num_contention_events" : "2",
"table_id" : "4294967163",
"txn_id" : "97e8d7ec-2a38-4880-9df6-3b51abc9086c"
}
{
"blocking_txn_fingerprint_id" : "\\x0000000000000000",
"blocking_txn_id" : "97e8d7ec-2a38-4880-9df6-3b51abc9086c",
"collection_ts" : "2023-06-08 13:42:19.554013+00",
"contending_key" : "\\x03f289f9071d8bd5",
"contending_pretty_key" : "/Meta2/Table/106/1/119376853",
"contention_duration" : "00:00:00.052876",
"database_name" : "",
"index_name" : "",
"schema_name" : "",
"table_name" : "",
"waiting_stmt_fingerprint_id" : "\\x66de6ed8999bbf0e",
"waiting_stmt_id" : "1766b2d540926e3c0000000000000001",
"waiting_txn_fingerprint_id" : "\\x0000000000000000",
"waiting_txn_id" : "55eed750-2e00-4680-b85b-d045eb9d45fb"
} Execution insights on the contending transaction {
"app_name" : "",
"causes" : "{}",
"contention" : "00:00:00",
"cpu_sql_nanos" : "6072049",
"database_name" : "defaultdb",
"end_time" : "2023-06-08 13:42:19.556586",
"error_code" : "XXUUU",
"exec_node_ids" : "{1}",
"full_scan" : "f",
"implicit_txn" : "t",
"index_recommendations" : "{}",
"last_retry_reason" : "NULL",
"plan_gist" : "AgICABoMAwUIBgg=",
"priority" : "normal",
"problem" : "FailedExecution",
"query" : "SELECT t.range_id, t.start_key_pretty, t.status, t.detail FROM ROWS FROM (crdb_internal.check_consistency(_, '_', '_')) AS t WHERE t.status NOT IN ('_', '_')",
"retries" : "0",
"rows_read" : "0",
"rows_written" : "0",
"session_id" : "1766b2d53cfd91a80000000000000001",
"start_time" : "2023-06-08 13:42:19.480564",
"status" : "Failed",
"stmt_fingerprint_id" : "\\x66de6ed8999bbf0e",
"stmt_id" : "1766b2d540926e3c0000000000000001",
"txn_fingerprint_id" : "\\x3ff3476808ae6c64",
"txn_id" : "55eed750-2e00-4680-b85b-d045eb9d45fb",
"user_name" : "root"
}
|
"circuit_breaker_error": "replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:/Table/106/1/119{118742-376853} [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {\"id\":\"2\",\"term\":7,\"vote\":\"2\",\"commit\":29,\"lead\":\"2\",\"raftState\":\"StateLeader\",\"applied\":29,\"progress\":{\"1\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"},\"2\":{\"match\":109,\"next\":110,\"state\":\"StateReplicate\"},\"3\":{\"match\":109,\"next\":113,\"state\":\"StateReplicate\"},\"4\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"}},\"leadtransferee\":\"0\"}: operation \"probe\" timed out after 1m0.002s (given timeout 1m0s): result is ambiguous: after 60.00s of attempting command: context deadline exceeded"` More details on Details"span": {
"start_key": "/Table/106/1/119118742",
"end_key": "/Table/106/1/119376853"
},
"raft_state": {
"replica_id": 2,
"hard_state": {
"term": 7,
"vote": 2,
"commit": 29
},
"lead": 2,
"state": "StateLeader",
"applied": 29,
"progress": {
"1": {
"next": 1,
"state": "StateSnapshot",
"paused": true,
"pending_snapshot": 29
},
"2": {
"match": 161,
"next": 162,
"state": "StateReplicate"
},
"3": {
"match": 161,
"next": 162,
"state": "StateReplicate"
},
"4": {
"next": 1,
"state": "StateSnapshot",
"paused": true,
"pending_snapshot": 26
}
}
},
"state": {
"state": {
"raft_applied_index": 29,
"lease_applied_index": 9,
"desc": {
"range_id": 548,
"start_key": "8on5Bxmblg==",
"end_key": "8on5Bx2L1Q==",
"internal_replicas": [
{
"node_id": 1,
"store_id": 1,
"replica_id": 1,
"type": 0
},
{
"node_id": 3,
"store_id": 3,
"replica_id": 2,
"type": 4
},
{
"node_id": 4,
"store_id": 4,
"replica_id": 3,
"type": 0
},
{
"node_id": 2,
"store_id": 2,
"replica_id": 4,
"type": 2
}
],
"next_replica_id": 5,
"generation": 475,
"sticky_bit": {}
},
"lease": {
"start": {
"wall_time": 1686231642153434301
},
"expiration": {
"wall_time": 1686231648153352643
},
"replica": {
"node_id": 3,
"store_id": 3,
"replica_id": 2,
"type": 0
},
"proposed_ts": {
"wall_time": 1686231642153352643
},
"sequence": 5,
"acquisition_type": 1
},
"truncated_state": {
"index": 10,
"term": 5
},
"gc_threshold": {},
"stats": {
"contains_estimates": 0,
"last_update_nanos": 1686231615586276112,
"intent_age": 0,
"gc_bytes_age": 0,
"live_bytes": 33554430,
"live_count": 258111,
"key_bytes": 5420331,
"key_count": 258111,
"val_bytes": 28134099,
"val_count": 258111,
"intent_bytes": 0,
"intent_count": 0,
"separated_intent_count": 0,
"range_key_count": 0,
"range_key_bytes": 0,
"range_val_count": 0,
"range_val_bytes": 0,
"sys_bytes": 866,
"sys_count": 8,
"abort_span_bytes": 0
},
"version": {
"major": 23,
"minor": 1,
"patch": 0,
"internal": 8
},
"raft_closed_timestamp": {
"wall_time": 1686231639296513274
},
"raft_applied_index_term": 7,
"gc_hint": {
"latest_range_delete_timestamp": {}
}
},
"last_index": 161,
"num_pending": 4,
"raft_log_size": 43070,
"approximate_proposal_quota": 8388183,
"proposal_quota_base_index": 29,
"range_max_bytes": 67108864,
"active_closed_timestamp": {
"wall_time": 1686231639296513274
},
"tenant_id": 1,
"closed_timestamp_sidetransport_info": {
"replica_closed": {
"wall_time": 1686231639279698235
},
"replica_lai": 7,
"central_closed": {}
},
"circuit_breaker_error": "replica unavailable: (n3,s3):2VOTER_DEMOTING_LEARNER unable to serve request to r548:/Table/106/1/119{118742-376853} [(n1,s1):1, (n3,s3):2VOTER_DEMOTING_LEARNER, (n4,s4):3, (n2,s2):4VOTER_INCOMING, next=5, gen=475]: closed timestamp: 1686231639.296513274,0 (2023-06-08 13:40:39); raft status: {\"id\":\"2\",\"term\":7,\"vote\":\"2\",\"commit\":29,\"lead\":\"2\",\"raftState\":\"StateLeader\",\"applied\":29,\"progress\":{\"1\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"},\"2\":{\"match\":109,\"next\":110,\"state\":\"StateReplicate\"},\"3\":{\"match\":109,\"next\":113,\"state\":\"StateReplicate\"},\"4\":{\"match\":0,\"next\":1,\"state\":\"StateSnapshot\"}},\"leadtransferee\":\"0\"}: operation \"probe\" timed out after 1m0.002s (given timeout 1m0s): result is ambiguous: after 60.00s of attempting command: context deadline exceeded"
}
We see that the incoming configuration (replicas 1, 3, 4) has two replicas in |
|
|
There is a large 3.6 GiB snapshot for
This was blocking all other snapshots incoming to node 1, which we see in the logs by filtering for "applied .* snapshot" messages:
|
This |
We're likely seeing the cascading effect described in cockroach/pkg/kv/kvserver/split_delay_helper.go Lines 109 to 148 in 87d6547
Notably, though, splits driven by the split queue are not "delayable" (see here and the other 2 invocations below it), i.e. this workaround is not active. We should probably enable these delays. |
Synopsis:
|
cc @cockroachdb/replication |
Going through the logs again just to button things up. The initial large range is r64. First, r64 gets hit by the randomized backup in the roachtest. Oops! The range is 16GiB at this point. We apply a 5min timeout to this export somewhere. So if you ever have a large range like this, backups are not going to work1 Details
First, r64 successfully adds a replica on n3 (the snapshot takes ~260s and streams at ~63MB/s, i.e. right at the max configured rate). Next, n1 adds a learner, but then that learner gets removed by n3, which apparently now holds the lease. We can infer from that that there's likely still a LEARNER snapshot inflight from n1->n3 in what follows below. n3 then tries to add its own LEARNER to n2 instead (replID 4):
We see in the logs that this overlaps with another attempt at exporting r64 because ~2 minutes later, we see this log line:
Less than a minute later, at around 13:34:43, we see n3 roll back the learner because sending the snapshot failed:
This is congruent with our earlier observation that there might still be a snapshot n1->n2 inflight - that snapshot would delay n3's. Finally, at 13:36:49, n1's snapshot fails with "Raft group deleted"; makes sense since that Replica is long stale. There is some more upreplication and it looks like the lease transfers (?) again from n3 to n1, and they step on each others toes more:
In particular, we hit this helpful logging:
Thanks to 2018 Tobias, we have a helpful comment2. Basically, if we split a range and then remove a replica, if that removed replica hasn't applied the split yet, it wipes out the would-be right hand side of the split. In other words, there's now a replica in the RHS that needs a snapshot - in our case, r74. The problem is now that we're carrying out tons of splits due to range size, and they're not delayable: cockroach/pkg/kv/kvserver/split_queue.go Lines 261 to 274 in 2b91c38
DetailsThe way splits work is that they carve off chunks off the LHS. So r74 will be split again, its RHS will be split again, etc., and each such split causes another snapshot. It would have helped here if these splits had been delayable, but since they delay by at most ~20s and the snapshots take "minutes", it likely would've slowed the problem but not eliminated it.34 One might try to argue that aggressively replicaGC'ing r64/6 was the issue. But note that r64/6 had been removed from the range, so it wouldn't necessarily even be caught up to the split in the first place, but it will block snapshots to its keyspace. So it does need to get removed ASAP. What could have been delayed is removing r64/6 from the descriptor; but how was the replicate queue on another node supposed to know that that replica was still pending a split? It could in theory have known - the split and the removal of the learner are both driven by the leaseholder, so similarly how we coordinate snapshots between the replicate and raft snapshot queue, the replicate queue could back off if there's an ongoing split attempt (and the split queue would have to wait until all followers have applied the split). I am not advocating for actually doing this, since all of these dependencies between queues are a real problem in and of itself. But there is a compelling argument that the independence of these queues is a problem. at this point we're flooding the raft snapshot queue with requests that it will be very slow to process. They are also following a pathological pattern due to the sequence of splits: each inflight snapshot will, by the time it arrives, make a replica that requires a ton more splits in order to be up to date, and which will block all but the snapshot for the very first split in that wide range. This means most of the time the raft snapshot queue will just cycle through replicas, getting rejections for all but ~1 replica, loop around, etc, each time wasting the 10min scanner interval. In other words, the backlog will clear out very slowly. But we haven't lost quorum at this point. This happens only later, when we make this ill-fated conf change:
The config is
s2 is just being promoted and this is what causes the loss of quorum. But, the promotion was carried out by the replicate queue, and didn't it just add the learner and sent it a snapshot? How is s2 snapshotless now (note the
The debug.zip also shows this replica present on s2. So this appears to be a failure in the replication layer - the data is there, but Raft is sticking to I've written elsewhere5 about the oddities of raft snapshot handling. It's just generally a little brittle and especially assumes much closer coupling between raft and the snapshots than is present in CRDB. Pure conjecture, but possible the following happened here:
Footnotes
|
I think this might be fixed as of https://github.com/cockroachdb/cockroach/pull/106793/files#diff-1e9884dfbcc4eb6d90a152e1023fd2a84dcda68f04a877be8a01c16df0fe0331R2888. When the MsgAppResp is received by the leader, this ought to be almost enough, and that PR achieves that. The one remaining edge case is that fixed by etcd-io/raft#84, so once we have that one in as well (tracked in #106813), I think we can consider closing this test failure issue. |
Filed #107520 to track the general poor interplay between splits and snapshots as triggered by replica movement in this test. It's something we (collectively) know about and this has been "a thing" for years, but it's hard to track since it lives in the spaces in between components. That issue will have to do. |
I don't follow this assertion. Individual ExportRequests, which are what have the 5min timeout, are paginated to a fixed 16mib, regardless of the underlying range size, so I don't think there is an automatic assumption that a bigger range should expect to hit this timeout. The only mechanism that comes to mind right now would very few keys in the range above the requested timestamp for an incremental exportreq, so the range iterates extensively without finding keys that match the timestamp predicate to return, meaning it doesn't hit the 16mib limit. We can discuss that more over in #107519 if we think that's what's up, but I certainly don't think we expect a large range to need a longer per-req timeout. |
You just know more than I do, feel free to re-title #107519 or close if this doesn't capture anything new! I only skimmed the surface in looking at the export and wasn't digging deeper into how it works under the hood. |
Closing. #104588 (comment) and #106813 on the assumption that #106793 practically removes the erroneous StateSnapshot-despite-having-received-snapshot. |
roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ c7a7e423506c88f655c08d6ba6acd6e907e1dae5:
Parameters:
ROACHTEST_arch=amd64
,ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-28616
The text was updated successfully, but these errors were encountered: