kvserver: TestReplicateQueueDeadNonVoters timing out #76996

amygao9 · 2022-02-24T18:55:08Z

Describe the problem

TestReplicateQueueDeadNonVoters fails in certain scenarios where a race between the replicateQueue's snapshots and
raftSnapshotQueue's causes the test to time out. When the raftSnapshotQueue wins this race, the replicateQueue's
snapshot fails and we mark the store as throttled for 5s, which makes the store not a valid rebalance target. Thus, when this race occurs 9 times, it stalls rebalancing and the test fails since the call to forceScanOnAllReplicationQueues inside checkReplicaCount will take longer than 45 seconds.

See comments in: #75251

Jira issue: CRDB-13363

The text was updated successfully, but these errors were encountered:

Previously the wrong object was being casted which caused the type assertion to fail for NodeLivenessKnobs. Thus the testing knob was never attached in some test cases. This patch ensures the correct object is casted. This uncovered a bug where a stale range descriptor was returned in setup fns, so this commit also patches that by having `checkReplicaCounts` take a pointer to a range decriptor. Lastly, due to these fixes we found that a race between the `replicateQueue`'s snapshots and `raftSnapshotQueue's` causes the `TestReplicateQueueDeadNonVoters` test to timeout. A new issue cockroachdb#76996 was created to harden this test, but the test will be skipped for the time being. Release note: None

ajwerner · 2022-02-28T16:22:01Z

Another flake: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Test/4473979?showRootCauses=false&expandBuildChangesSection=true&expandBuildDeploymentsSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

rafiss · 2023-08-18T16:56:00Z

this test is skipped here:

cockroach/pkg/kv/kvserver/replicate_queue_test.go

Line 1034 in 60fa6c0

skip.WithIssue(t, 76996)

`TestReplicateQueueDeadNonVoters` would flake periodically due to racing between the snapshot and replicate queue. Add a filter to queue processing in order to only process the scratch range. This speeds up the test and prevents timing conditions between the two queues under stress. The test remains skipped under stress-race due to its size (5 nodes) causing timeouts. Resolves: cockroachdb#76996 Epic: none Release note: None

110057: kvserver: unskip and deflake replicate queue dead non voters r=andrewbaptist a=kvoli TestReplicateQueueDeadNonVoters would flake periodically due to racing between the snapshot and replicate queue. Add a filter to queue processing in order to only process the scratch range. This speeds up the test and prevents timing conditions between the two queues under stress. The test remains skipped under stress-race due to its size (5 nodes) causing timeouts. Resolves: #76996 Epic: none Release note: None 110818: kvserver: export secondary cache counter metrics r=joshimhoff a=joshimhoff Fixes #110899. **kvserver: export secondary cache counter metrics** This commits exports counter metrics regarding the secondary cache added at cockroachdb/pebble#2760. This commit doesn't export the histogram metrics added in that PR. While working on this one, I have realized a follow up PR is needed to export the bucketing scheme to CRDB. Release note: None. 111039: sql: fix flaky TestHotRangesStats r=j82w a=j82w The test failed with timeouts under stress. The test is switched to use the SucceedsSoon which allows for more time when under stress. Fixes: #111036 Release note: None Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Josh Imhoff <[email protected]> Co-authored-by: j82w <[email protected]>

amygao9 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 24, 2022

amygao9 assigned aayushshah15 Feb 24, 2022

aayushshah15 removed their assignment Jul 4, 2022

AlexTalks mentioned this issue Aug 25, 2022

kvserver: fix replicate queue metrics tracking bug #86844

Merged

rafiss added skipped-test T-kv KV Team labels Aug 18, 2023

kvoli added the branch-master Failures and bugs on the master branch. label Aug 25, 2023

kvoli self-assigned this Aug 29, 2023

kvoli mentioned this issue Sep 5, 2023

kvserver: unskip and deflake replicate queue dead non voters #110057

Merged

craig bot closed this as completed in 399eaf7 Sep 21, 2023

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: TestReplicateQueueDeadNonVoters timing out #76996

kvserver: TestReplicateQueueDeadNonVoters timing out #76996

amygao9 commented Feb 24, 2022 •

edited by cockroach-jira-scripts

Loading

ajwerner commented Feb 28, 2022

rafiss commented Aug 18, 2023

kvserver: TestReplicateQueueDeadNonVoters timing out #76996

kvserver: TestReplicateQueueDeadNonVoters timing out #76996

Comments

amygao9 commented Feb 24, 2022 • edited by cockroach-jira-scripts Loading

ajwerner commented Feb 28, 2022

rafiss commented Aug 18, 2023

amygao9 commented Feb 24, 2022 •

edited by cockroach-jira-scripts

Loading