Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: TestReplicateQueueDeadNonVoters timing out #76996

Closed
amygao9 opened this issue Feb 24, 2022 · 2 comments · Fixed by #110057
Closed

kvserver: TestReplicateQueueDeadNonVoters timing out #76996

amygao9 opened this issue Feb 24, 2022 · 2 comments · Fixed by #110057
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-kv KV Team

Comments

@amygao9
Copy link
Contributor

amygao9 commented Feb 24, 2022

Describe the problem

TestReplicateQueueDeadNonVoters fails in certain scenarios where a race between the replicateQueue's snapshots and
raftSnapshotQueue's causes the test to time out. When the raftSnapshotQueue wins this race, the replicateQueue's
snapshot fails and we mark the store as throttled for 5s, which makes the store not a valid rebalance target. Thus, when this race occurs 9 times, it stalls rebalancing and the test fails since the call to forceScanOnAllReplicationQueues inside checkReplicaCount will take longer than 45 seconds.

See comments in: #75251

Jira issue: CRDB-13363

@amygao9 amygao9 added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Feb 24, 2022
amygao9 added a commit to amygao9/cockroach that referenced this issue Feb 24, 2022
Previously the wrong object was being casted which caused the type
assertion to fail for NodeLivenessKnobs.  Thus the testing knob was never
attached in some test cases.  This patch ensures the correct object is
casted.  This uncovered a bug where a stale range descriptor was returned
in setup fns, so this commit also patches that by having
`checkReplicaCounts` take a pointer to a range decriptor. Lastly, due to
these fixes we found that a race between the `replicateQueue`'s snapshots and
`raftSnapshotQueue's` causes the `TestReplicateQueueDeadNonVoters` test
to timeout. A new issue cockroachdb#76996 was created to harden this test, but the
test will be skipped for the time being.

Release note: None
@rafiss
Copy link
Collaborator

rafiss commented Aug 18, 2023

this test is skipped here:

skip.WithIssue(t, 76996)

@kvoli kvoli added the branch-master Failures and bugs on the master branch. label Aug 25, 2023
@kvoli kvoli self-assigned this Aug 29, 2023
kvoli added a commit to kvoli/cockroach that referenced this issue Sep 14, 2023
`TestReplicateQueueDeadNonVoters` would flake periodically due to racing
between the snapshot and replicate queue.

Add a filter to queue processing in order to only process the scratch
range. This speeds up the test and prevents timing conditions between
the two queues under stress. The test remains skipped under stress-race
due to its size (5 nodes) causing timeouts.

Resolves: cockroachdb#76996
Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Sep 21, 2023
110057: kvserver: unskip and deflake replicate queue dead non voters r=andrewbaptist a=kvoli

TestReplicateQueueDeadNonVoters would flake periodically due to racing
between the snapshot and replicate queue.

Add a filter to queue processing in order to only process the scratch
range. This speeds up the test and prevents timing conditions between
the two queues under stress. The test remains skipped under stress-race
due to its size (5 nodes) causing timeouts.

Resolves: #76996
Epic: none
Release note: None

110818: kvserver: export secondary cache counter metrics r=joshimhoff a=joshimhoff

Fixes #110899.

**kvserver: export secondary cache counter metrics**

This commits exports counter metrics regarding the secondary cache added at cockroachdb/pebble#2760. This commit doesn't export the histogram metrics added in that PR. While working on this one, I have realized a follow up PR is needed to export the bucketing scheme to CRDB.

Release note: None.

111039: sql: fix flaky TestHotRangesStats r=j82w a=j82w

The test failed with timeouts under stress. The test is switched to use the SucceedsSoon which allows for more time when under stress.

Fixes: #111036

Release note: None

Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Josh Imhoff <[email protected]>
Co-authored-by: j82w <[email protected]>
@craig craig bot closed this as completed in 399eaf7 Sep 21, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test T-kv KV Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants