-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: TestReplicateQueueDeadNonVoters timing out #76996
Labels
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
skipped-test
T-kv
KV Team
Comments
amygao9
added
the
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
label
Feb 24, 2022
amygao9
added a commit
to amygao9/cockroach
that referenced
this issue
Feb 24, 2022
Previously the wrong object was being casted which caused the type assertion to fail for NodeLivenessKnobs. Thus the testing knob was never attached in some test cases. This patch ensures the correct object is casted. This uncovered a bug where a stale range descriptor was returned in setup fns, so this commit also patches that by having `checkReplicaCounts` take a pointer to a range decriptor. Lastly, due to these fixes we found that a race between the `replicateQueue`'s snapshots and `raftSnapshotQueue's` causes the `TestReplicateQueueDeadNonVoters` test to timeout. A new issue cockroachdb#76996 was created to harden this test, but the test will be skipped for the time being. Release note: None
this test is skipped here: cockroach/pkg/kv/kvserver/replicate_queue_test.go Line 1034 in 60fa6c0
|
kvoli
added a commit
to kvoli/cockroach
that referenced
this issue
Sep 14, 2023
`TestReplicateQueueDeadNonVoters` would flake periodically due to racing between the snapshot and replicate queue. Add a filter to queue processing in order to only process the scratch range. This speeds up the test and prevents timing conditions between the two queues under stress. The test remains skipped under stress-race due to its size (5 nodes) causing timeouts. Resolves: cockroachdb#76996 Epic: none Release note: None
craig bot
pushed a commit
that referenced
this issue
Sep 21, 2023
110057: kvserver: unskip and deflake replicate queue dead non voters r=andrewbaptist a=kvoli TestReplicateQueueDeadNonVoters would flake periodically due to racing between the snapshot and replicate queue. Add a filter to queue processing in order to only process the scratch range. This speeds up the test and prevents timing conditions between the two queues under stress. The test remains skipped under stress-race due to its size (5 nodes) causing timeouts. Resolves: #76996 Epic: none Release note: None 110818: kvserver: export secondary cache counter metrics r=joshimhoff a=joshimhoff Fixes #110899. **kvserver: export secondary cache counter metrics** This commits exports counter metrics regarding the secondary cache added at cockroachdb/pebble#2760. This commit doesn't export the histogram metrics added in that PR. While working on this one, I have realized a follow up PR is needed to export the bucketing scheme to CRDB. Release note: None. 111039: sql: fix flaky TestHotRangesStats r=j82w a=j82w The test failed with timeouts under stress. The test is switched to use the SucceedsSoon which allows for more time when under stress. Fixes: #111036 Release note: None Co-authored-by: Austen McClernon <[email protected]> Co-authored-by: Josh Imhoff <[email protected]> Co-authored-by: j82w <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
branch-master
Failures and bugs on the master branch.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
skipped-test
T-kv
KV Team
Describe the problem
TestReplicateQueueDeadNonVoters
fails in certain scenarios where a race between thereplicateQueue
's snapshots andraftSnapshotQueue's
causes the test to time out. When theraftSnapshotQueue
wins this race, thereplicateQueue's
snapshot fails and we mark the store as throttled for 5s, which makes the store not a valid rebalance target. Thus, when this race occurs 9 times, it stalls rebalancing and the test fails since the call to
forceScanOnAllReplicationQueues
insidecheckReplicaCount
will take longer than 45 seconds.See comments in: #75251
Jira issue: CRDB-13363
The text was updated successfully, but these errors were encountered: