roachtest: rebalance/by-load/replicas failed #79118

cockroach-teamcity · 2022-03-31T12:33:07Z

roachtest.rebalance/by-load/replicas failed with artifacts on master @ e4fcae1625a35917d08a2a569d24529b0cc089d2:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/rebalance/by-load/replicas/run_1
	rebalance_load.go:128,rebalance_load.go:155,test_runner.go:875: timed out before leases were evenly spread

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!

Jira issue: CRDB-14583}

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2022-04-04T14:49:32Z

12:26:56 rebalance_load.go:191: not all nodes have a lease yet: [s4: 1, s5: 1, s6: 4]
12:27:01 rebalance_load.go:191: not all nodes have a lease yet: [s4: 1, s5: 1, s6: 4]
...
12:31:51 rebalance_load.go:191: not all nodes have a lease yet: [s2: 1, s3: 1, s4: 1, s5: 2, s6: 1]
12:31:56 rebalance_load.go:191: not all nodes have a lease yet: [s2: 1, s3: 1, s4: 1, s5: 2, s6: 1]
12:32:01 test_impl.go:312: test failure: 	rebalance_load.go:128,rebalance_load.go:155,test_runner.go:875: timed out before leases were evenly spread

Over the course of 5 minutes, the leases on this table fail to make their way to store 1.

We have high-verbosity logging enabled and can see why store 5 is not transferring leases away:

I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 â‹® [n5,replicate,s5,r37/2:â€¹/Table/4{1-2}â€º] 1078  ShouldTransferLease (lease-holder=5):
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 â‹® [n5,replicate,s5,r37/2:â€¹/Table/4{1-2}â€º] 1078 +â€¹  candidate: avg-ranges=39.333333333333336 avg-leases=8.333333333333332 avg-disk-usage=10 MiB avg-queries-per-second=4180.5905469332265â€º
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 â‹® [n5,replicate,s5,r37/2:â€¹/Table/4{1-2}â€º] 1078 +â€¹  3: ranges=39 leases=8 disk-usage=2.3 MiB queries-per-second=4171.06 l0-sublevels=0â€º
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 â‹® [n5,replicate,s5,r37/2:â€¹/Table/4{1-2}â€º] 1078 +â€¹  1: ranges=38 leases=8 disk-usage=972 KiB queries-per-second=11.51 l0-sublevels=0â€º
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 â‹® [n5,replicate,s5,r37/2:â€¹/Table/4{1-2}â€º] 1078 +â€¹  4: ranges=40 leases=7 disk-usage=19 MiB queries-per-second=4166.16 l0-sublevels=0â€º
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 â‹® [n5,replicate,s5,r37/2:â€¹/Table/4{1-2}â€º] 1078 +â€¹  6: ranges=39 leases=8 disk-usage=18 MiB queries-per-second=3993.85 l0-sublevels=0â€º
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 â‹® [n5,replicate,s5,r37/2:â€¹/Table/4{1-2}â€º] 1078 +â€¹  5: ranges=40 leases=10 disk-usage=18 MiB queries-per-second=8552.84 l0-sublevels=0â€º
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 â‹® [n5,replicate,s5,r37/2:â€¹/Table/4{1-2}â€º] 1078 +â€¹  2: ranges=40 leases=9 disk-usage=2.6 MiB queries-per-second=4188.13 l0-sublevels=0â€º
I220331 12:31:59.544600 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1843 â‹® [n5,replicate,s5,r37/2:â€¹/Table/4{1-2}â€º] 1079  ShouldTransferLease decision (lease-holder=5): false

1: ranges=38 leases=8 disk-usage=972 KiB queries-per-second=11.51
...
5: ranges=40 leases=10 disk-usage=18 MiB queries-per-second=8552.84
...
ShouldTransferLease decision (lease-holder=5): false

This is surprising to me.

aayushshah15 · 2022-04-04T17:17:56Z

This is surprising to me.

I think that's expected because it's the replicate queue's output, which only considers lease counts.

Looking at the logs, we see that the store-rebalancer does detect the expected rebalance opportunity.

kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:732 ⋮ [n5,s5,store-rebalancer] 812 rebalancing voter (qps=4105.78) for r47 on n5,s5 to n1,s1 in order to improve QPS balance

Then, this is what we see with that rebalance:

kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:395 ⋮ [n5,s5,store-rebalancer] 831 unable to relocate range to [n1,s1 n6,s6 n4,s4]: while carrying out changes [{ADD_VOTER n1,s1} {REMOVE_VOTER n5,s5}]: removing learners from r47:‹/Table/106/1/-{6148914691236517888-3074457345618258944}› [(n4,s4):1, (n6,s6):4, (n5,s5):7LEARNER, (n1,s1):8, next=9, gen=32, sticky=9223372036.854775807,2147483647]: change replicas of r47 failed: descriptor changed: [expected] r47:‹/Table/106/1/-{6148914691236517888-3074457345618258944}› [(n4,s4):1, (n6,s6):4, (n5,s5):7LEARNER, (n1,s1):8, next=9, gen=32, sticky=9223372036.854775807,2147483647] != [actual] r47:‹/Table/106/1/-{6148914691236517888-3074457345618258944}› [(n4,s4):1, (n6,s6):4, (n1,s1):8, next=9, gen=33, sticky=9223372036.854775807,2147483647]

The rebalance technically succeeded but the AdminRelocateRange call failed with an error due to the LEARNER replica getting cleared up.

When this AdminRelocateRange call fails, the StoreRebalancer assumes that the replica did not actually move and does not change its in-memory account of its QPS. This means that it will try moving more replicas away than it should. We see this in the test where s5 has two hot ranges, and s5 tries rebalancing both of them away redundantly. Both of these rebalances failed for the same reason.

rebalancing r46 (4140.25 qps) to better balance load: voters from (n4,s4):1,(n5,s5):7,(n6,s6):4 to [n1,s1 n4,s4 n6,s6];
...
rebalancing r47 (4105.78 qps) to better balance load: voters from (n4,s4):1,(n6,s6):4,(n5,s5):7 to [n1,s1 n6,s6 n4,s4];

I think we could:

Not return an error from AdminChangeReplicas when it's at the last step of removing the demoted learner replica and it finds that the learner has already been removed.
In the store-rebalancer, when an AdminRelocateRange call errors out, we should be re-reading the range's descriptor afresh to determine whether we actually moved any replicas.
We should implement the approach mentioned in this comment, to make these races more unlikely: kvserver: merge queue racing against replicate queue #57129 (comment)

nvanbenschoten · 2022-04-04T17:35:12Z

All three of your suggestions sound beneficial. Of them, the first sounds the most straightforward and least risky. It may be small enough to slip into a 22.1 point release.

How do you feel about removing the release blocker label from this issue and opening two new "quick win" issues for the first two items?

nvanbenschoten · 2022-04-04T17:36:43Z

We should implement the approach mentioned in this comment, to make these races more unlikely

Was there enough in the logs to determine who was racing with the StoreRebalancer?

This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none

79405: kvserver: gracefully handle races during learner removals r=aayushshah15 a=aayushshah15 Previously, at the end of a replication change, if the `ChangeReplicas` request found that the (demoted) learner replica it was trying to remove from the range had already been removed (presumably because it raced with the mergeQueue, the `StoreRebalancer`, or something else), it would error out. This was unfortunate, because, for all practical purposes, the replication change _had succeeded_. We can now gracefully handle this instead by no-oping if we detect that the replica we were trying to remove has already been removed. Relates to #79118 Release note: None Jira issue: CRDB-14809 Co-authored-by: Aayush Shah <[email protected]>

This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none

79379: kvserver: avoid races where replication changes can get interrupted r=aayushshah15 a=aayushshah15 This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes #57129 Relates to #79118 Release note: none Jira issue: CRDB-14769 79853: changefeedccl: support a CSV format for changefeeds r=sherman-grewal a=sherman-grewal In this PR, we introduce a new CSV format for changefeeds. Note that this format is only supported with the initial_scan='only' option. For instance, one can now execute: CREATE CHANGEFEED FOR foo WITH format=csv, initial_scan='only'; Release note (enterprise change): Support a CSV format for changefeeds. Only works with initial_scan='only', and does not work with diff/resolved options. 80397: spanconfig: handle mismatched desc types post-restore r=irfansharif a=irfansharif Fixes #75831, an annoying bug in the intersection between the span configs infrastructure + backup/restore. It's possible to observe mismatched descriptor types for the same ID post-RESTORE, an invariant the span configs infrastructure relies on. This paper simply papers over this mismatch, kicking off a full reconciliation process to recover if it occurs. Doing something "better" is a lot more invasive, the options being: - pausing the reconciliation job during restore (prototyped in #80339); - observing a reconciler checkpoint in the restore job (work since we would have flushed out RESTORE's descriptor deletions and separately handle the RESTORE's descriptor additions -- them having different types would not fire the assertion); - re-keying restored descriptors to not re-use the same IDs as existing schema objects. While here, we add a bit of plumbing/testing to make the future work/testing for \#73694 (using reconciler checkpoints on retries) easier. This PR also sets the stage for the following pattern around use of checkpoints: 1. We'll use checkpoints and incrementally reconciler during job-internal retries (added in #78117); 2. We'll always fully reconcile (i.e. ignore checkpoints) when the job itself is bounced around. We do this because we need to fully reconcile across job restarts if the reason for the restart is due to RESTORE-induced errors. This is a bit unfortunate, and if we want to improve on (2), we'd have to persist job state (think "poison pill") that ensures that we ignore the persisted checkpoint. As of this PR, the only use of job-persisted checkpoints are the migrations rolling out this infrastructure. That said, now we'll have a mechanism to force a full reconciliation attempt -- we can: ``` -- get $job_id SELECT job_id FROM [SHOW AUTOMATIC JOBS] WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION' PAUSE JOB $job_id RESUME JOB $job_id ``` Release note: None 80410: ui: display closed sessions, add username and session status filter r=gtr a=gtr Fixes #67888, #79914. Previously, the sessions page UI did not support displaying closed sessions and did not support the ability to filter by username or session status. This commit adds the "Closed" session status to closed sessions and adds the ability to filter by username and session status. Session Status: https://user-images.githubusercontent.com/35943354/164794955-5a48d6c2-589d-4f05-b476-b30b114662ee.mov Usernames: https://user-images.githubusercontent.com/35943354/164797165-f00f9760-7127-4f2a-96bd-88f691395693.mov Release note (ui change): sessions overview and session details pages now display closed sessions; sessions overview page now has username and session status filters Co-authored-by: Aayush Shah <[email protected]> Co-authored-by: Sherman Grewal <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Gerardo Torres <[email protected]>

aayushshah15 · 2022-04-27T16:34:13Z

Closed by 466f42c

This commit adds a safeguard inside `Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing learner replicas _when we know_ that that learner replica is in the process of receiving its initial snapshot (as indicated by an in-memory lock on log truncations that we place while the snapshot is in-flight). This change should considerably reduce the instances where `AdminRelocateRange` calls are interrupted by the mergeQueue or the replicateQueue (and vice versa). Fixes cockroachdb#57129 Relates to cockroachdb#79118 Release note: none

blathers-crl bot added the T-kv KV Team label Mar 31, 2022

celiala added the blocks-22.1.0-beta.2 label Apr 4, 2022

aayushshah15 removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. blocks-22.1.0-beta.2 labels Apr 4, 2022

aayushshah15 mentioned this issue Apr 4, 2022

kvserver: avoid races where replication changes can get interrupted #79379

Merged

aayushshah15 mentioned this issue Apr 5, 2022

kvserver: gracefully handle races during learner removals #79405

Merged

exalate-issue-sync bot mentioned this issue Apr 7, 2022

kvserver: avoid races where replication changes can get interrupted cockroachdb/docs#13537

Closed

AlexTalks added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Apr 19, 2022

AlexTalks assigned aayushshah15 Apr 19, 2022

aayushshah15 closed this as completed Apr 27, 2022

aayushshah15 mentioned this issue Apr 27, 2022

release-22.1: kvserver: avoid races where replication changes can get interrupted #80647

Merged

jlinder added the sync-me-3 label May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: rebalance/by-load/replicas failed #79118

roachtest: rebalance/by-load/replicas failed #79118

cockroach-teamcity commented Mar 31, 2022 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Apr 4, 2022

aayushshah15 commented Apr 4, 2022

nvanbenschoten commented Apr 4, 2022

nvanbenschoten commented Apr 4, 2022

aayushshah15 commented Apr 27, 2022

roachtest: rebalance/by-load/replicas failed #79118

roachtest: rebalance/by-load/replicas failed #79118

Comments

cockroach-teamcity commented Mar 31, 2022 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented Apr 4, 2022

aayushshah15 commented Apr 4, 2022

nvanbenschoten commented Apr 4, 2022

nvanbenschoten commented Apr 4, 2022

aayushshah15 commented Apr 27, 2022

cockroach-teamcity commented Mar 31, 2022 •

edited by cockroach-jira-scripts

Loading