Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: rebalance/by-load/replicas failed #79118

Closed
cockroach-teamcity opened this issue Mar 31, 2022 · 5 comments
Closed

roachtest: rebalance/by-load/replicas failed #79118

cockroach-teamcity opened this issue Mar 31, 2022 · 5 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 31, 2022

roachtest.rebalance/by-load/replicas failed with artifacts on master @ e4fcae1625a35917d08a2a569d24529b0cc089d2:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /artifacts/rebalance/by-load/replicas/run_1
	rebalance_load.go:128,rebalance_load.go:155,test_runner.go:875: timed out before leases were evenly spread
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-14583

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 31, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 31, 2022
@nvanbenschoten
Copy link
Member

12:26:56 rebalance_load.go:191: not all nodes have a lease yet: [s4: 1, s5: 1, s6: 4]
12:27:01 rebalance_load.go:191: not all nodes have a lease yet: [s4: 1, s5: 1, s6: 4]
...
12:31:51 rebalance_load.go:191: not all nodes have a lease yet: [s2: 1, s3: 1, s4: 1, s5: 2, s6: 1]
12:31:56 rebalance_load.go:191: not all nodes have a lease yet: [s2: 1, s3: 1, s4: 1, s5: 2, s6: 1]
12:32:01 test_impl.go:312: test failure: 	rebalance_load.go:128,rebalance_load.go:155,test_runner.go:875: timed out before leases were evenly spread

Over the course of 5 minutes, the leases on this table fail to make their way to store 1.

We have high-verbosity logging enabled and can see why store 5 is not transferring leases away:

I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 ⋮ [n5,replicate,s5,r37/2:‹/Table/4{1-2}›] 1078  ShouldTransferLease (lease-holder=5):
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 ⋮ [n5,replicate,s5,r37/2:‹/Table/4{1-2}›] 1078 +‹  candidate: avg-ranges=39.333333333333336 avg-leases=8.333333333333332 avg-disk-usage=10 MiB avg-queries-per-second=4180.5905469332265›
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 ⋮ [n5,replicate,s5,r37/2:‹/Table/4{1-2}›] 1078 +‹  3: ranges=39 leases=8 disk-usage=2.3 MiB queries-per-second=4171.06 l0-sublevels=0›
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 ⋮ [n5,replicate,s5,r37/2:‹/Table/4{1-2}›] 1078 +‹  1: ranges=38 leases=8 disk-usage=972 KiB queries-per-second=11.51 l0-sublevels=0›
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 ⋮ [n5,replicate,s5,r37/2:‹/Table/4{1-2}›] 1078 +‹  4: ranges=40 leases=7 disk-usage=19 MiB queries-per-second=4166.16 l0-sublevels=0›
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 ⋮ [n5,replicate,s5,r37/2:‹/Table/4{1-2}›] 1078 +‹  6: ranges=39 leases=8 disk-usage=18 MiB queries-per-second=3993.85 l0-sublevels=0›
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 ⋮ [n5,replicate,s5,r37/2:‹/Table/4{1-2}›] 1078 +‹  5: ranges=40 leases=10 disk-usage=18 MiB queries-per-second=8552.84 l0-sublevels=0›
I220331 12:31:59.544492 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1813 ⋮ [n5,replicate,s5,r37/2:‹/Table/4{1-2}›] 1078 +‹  2: ranges=40 leases=9 disk-usage=2.6 MiB queries-per-second=4188.13 l0-sublevels=0›
I220331 12:31:59.544600 1272396 kv/kvserver/pkg/kv/kvserver/allocator.go:1843 ⋮ [n5,replicate,s5,r37/2:‹/Table/4{1-2}›] 1079  ShouldTransferLease decision (lease-holder=5): false
1: ranges=38 leases=8 disk-usage=972 KiB queries-per-second=11.51
...
5: ranges=40 leases=10 disk-usage=18 MiB queries-per-second=8552.84
...
ShouldTransferLease decision (lease-holder=5): false

This is surprising to me.

@aayushshah15
Copy link
Contributor

This is surprising to me.

I think that's expected because it's the replicate queue's output, which only considers lease counts.


Looking at the logs, we see that the store-rebalancer does detect the expected rebalance opportunity.

kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:732 ⋮ [n5,s5,store-rebalancer] 812 rebalancing voter (qps=4105.78) for r47 on n5,s5 to n1,s1 in order to improve QPS balance

Then, this is what we see with that rebalance:

kv/kvserver/pkg/kv/kvserver/store_rebalancer.go:395 ⋮ [n5,s5,store-rebalancer] 831 unable to relocate range to [n1,s1 n6,s6 n4,s4]: while carrying out changes [{ADD_VOTER n1,s1} {REMOVE_VOTER n5,s5}]: removing learners from r47:‹/Table/106/1/-{6148914691236517888-3074457345618258944}› [(n4,s4):1, (n6,s6):4, (n5,s5):7LEARNER, (n1,s1):8, next=9, gen=32, sticky=9223372036.854775807,2147483647]: change replicas of r47 failed: descriptor changed: [expected] r47:‹/Table/106/1/-{6148914691236517888-3074457345618258944}› [(n4,s4):1, (n6,s6):4, (n5,s5):7LEARNER, (n1,s1):8, next=9, gen=32, sticky=9223372036.854775807,2147483647] != [actual] r47:‹/Table/106/1/-{6148914691236517888-3074457345618258944}› [(n4,s4):1, (n6,s6):4, (n1,s1):8, next=9, gen=33, sticky=9223372036.854775807,2147483647]

The rebalance technically succeeded but the AdminRelocateRange call failed with an error due to the LEARNER replica getting cleared up.

When this AdminRelocateRange call fails, the StoreRebalancer assumes that the replica did not actually move and does not change its in-memory account of its QPS. This means that it will try moving more replicas away than it should. We see this in the test where s5 has two hot ranges, and s5 tries rebalancing both of them away redundantly. Both of these rebalances failed for the same reason.

rebalancing r46 (4140.25 qps) to better balance load: voters from (n4,s4):1,(n5,s5):7,(n6,s6):4 to [n1,s1 n4,s4 n6,s6];
...
rebalancing r47 (4105.78 qps) to better balance load: voters from (n4,s4):1,(n6,s6):4,(n5,s5):7 to [n1,s1 n6,s6 n4,s4]; 

I think we could:

  1. Not return an error from AdminChangeReplicas when it's at the last step of removing the demoted learner replica and it finds that the learner has already been removed.
  2. In the store-rebalancer, when an AdminRelocateRange call errors out, we should be re-reading the range's descriptor afresh to determine whether we actually moved any replicas.
  3. We should implement the approach mentioned in this comment, to make these races more unlikely: kvserver: merge queue racing against replicate queue #57129 (comment)

@nvanbenschoten
Copy link
Member

All three of your suggestions sound beneficial. Of them, the first sounds the most straightforward and least risky. It may be small enough to slip into a 22.1 point release.

How do you feel about removing the release blocker label from this issue and opening two new "quick win" issues for the first two items?

@nvanbenschoten
Copy link
Member

We should implement the approach mentioned in this comment, to make these races more unlikely

Was there enough in the logs to determine who was racing with the StoreRebalancer?

@aayushshah15 aayushshah15 removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. blocks-22.1.0-beta.2 labels Apr 4, 2022
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 4, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 4, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 6, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
tbg pushed a commit to tbg/cockroach that referenced this issue Apr 6, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 6, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 7, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 7, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 9, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 11, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 11, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
craig bot pushed a commit that referenced this issue Apr 12, 2022
79405: kvserver: gracefully handle races during learner removals r=aayushshah15 a=aayushshah15

Previously, at the end of a replication change, if the `ChangeReplicas` request
found that the (demoted) learner replica it was trying to remove from the range
had already been removed (presumably because it raced with the mergeQueue, the
`StoreRebalancer`, or something else), it would error out. This was
unfortunate, because, for all practical purposes, the replication change _had
succeeded_.

We can now gracefully handle this instead by no-oping if we detect that the
replica we were trying to remove has already been removed.

Relates to #79118

Release note: None

Jira issue: CRDB-14809

Co-authored-by: Aayush Shah <[email protected]>
@AlexTalks AlexTalks added the S-1 High impact: many users impacted, serious risk of high unavailability or data loss label Apr 19, 2022
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 21, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 21, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 26, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
craig bot pushed a commit that referenced this issue Apr 26, 2022
79379: kvserver: avoid races where replication changes can get interrupted r=aayushshah15 a=aayushshah15

This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes #57129
Relates to #79118

Release note: none

Jira issue: CRDB-14769

79853: changefeedccl: support a CSV format for changefeeds r=sherman-grewal a=sherman-grewal

In this PR, we introduce a new CSV format for changefeeds.
Note that this format is only supported with the
initial_scan='only' option. For instance, one can now
execute:

CREATE CHANGEFEED FOR foo WITH format=csv, initial_scan='only';

Release note (enterprise change): Support a CSV format for
changefeeds. Only works with initial_scan='only', and
does not work with diff/resolved options.

80397: spanconfig: handle mismatched desc types post-restore r=irfansharif a=irfansharif

Fixes #75831, an annoying bug in the intersection between the span
configs infrastructure + backup/restore.

It's possible to observe mismatched descriptor types for the same ID
post-RESTORE, an invariant the span configs infrastructure relies on.
This paper simply papers over this mismatch, kicking off a full
reconciliation process to recover if it occurs. Doing something "better"
is a lot more invasive, the options being:
- pausing the reconciliation job during restore (prototyped in #80339);
- observing a reconciler checkpoint in the restore job (work since we
  would have flushed out RESTORE's descriptor deletions and separately
  handle the RESTORE's descriptor additions -- them having different
  types would not fire the assertion);
- re-keying restored descriptors to not re-use the same IDs as existing
  schema objects.

While here, we add a bit of plumbing/testing to make the future
work/testing for \#73694 (using reconciler checkpoints on retries)
easier. This PR also sets the stage for the following pattern around use
of checkpoints:
1. We'll use checkpoints and incrementally reconciler during job-internal
   retries (added in #78117);
2. We'll always fully reconcile (i.e. ignore checkpoints) when the job
   itself is bounced around.

We do this because we need to fully reconcile across job restarts if the
reason for the restart is due to RESTORE-induced errors. This is a bit
unfortunate, and if we want to improve on (2), we'd have to persist job
state (think "poison pill") that ensures that we ignore the persisted
checkpoint. As of this PR, the only use of job-persisted checkpoints are
the migrations rolling out this infrastructure. That said, now we'll
have a mechanism to force a full reconciliation attempt -- we can:

```
   -- get $job_id
   SELECT job_id FROM [SHOW AUTOMATIC JOBS]
   WHERE job_type = 'AUTO SPAN CONFIG RECONCILIATION'

   PAUSE JOB $job_id
   RESUME JOB $job_id
```

Release note: None

80410: ui: display closed sessions, add username and session status filter r=gtr a=gtr

Fixes #67888, #79914.

Previously, the sessions page UI did not support displaying closed
sessions and did not support the ability to filter by username or
session status. This commit adds the "Closed" session status to closed
sessions and adds the ability to filter by username and session status.

Session Status:
https://user-images.githubusercontent.com/35943354/164794955-5a48d6c2-589d-4f05-b476-b30b114662ee.mov

Usernames:
https://user-images.githubusercontent.com/35943354/164797165-f00f9760-7127-4f2a-96bd-88f691395693.mov

Release note (ui change): sessions overview and session details pages now
display closed sessions; sessions overview page now has username and session
status filters

Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Sherman Grewal <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Gerardo Torres <[email protected]>
@aayushshah15
Copy link
Contributor

Closed by 466f42c

aayushshah15 added a commit to aayushshah15/cockroach that referenced this issue Apr 27, 2022
This commit adds a safeguard inside
`Replica.maybeLeaveAtomicChangeReplicasAndRemoveLearners()` to avoid removing
learner replicas _when we know_ that that learner replica is in the process of
receiving its initial snapshot (as indicated by an in-memory lock on log
truncations that we place while the snapshot is in-flight).

This change should considerably reduce the instances where `AdminRelocateRange`
calls are interrupted by the mergeQueue or the replicateQueue (and vice versa).

Fixes cockroachdb#57129
Relates to cockroachdb#79118

Release note: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants