-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: gossip/chaos/nodes=9 failed [needs #138904] #130737
Comments
Well - this is interesting. A story best told in a code-output interleaving: /pkg/cmd/roachtest/tests/gossip.go#L122-L135 waitForGossip := func(deadNode int) {
// 2024/09/14 08:06:29 gossip.go:123: test status: waiting for gossip to exclude dead node 7
t.Status(fmt.Sprintf("waiting for gossip to exclude dead node %d", deadNode))
start := timeutil.Now()
for {
// 2024/09/14 08:06:29 gossip.go:126: checking if gossip excludes dead node 7 (0s)
t.L().Printf("checking if gossip excludes dead node %d (%.0fs)\n",
deadNode, timeutil.Since(start).Seconds())
// 2024/09/14 08:06:29 gossip.go:88: 1: checking gossip
// 2024/09/14 08:06:29 gossip.go:92: 1: gossip not ok (dead node 7 present) (0s)
if gossipOK(start, deadNode) {
return
}
const sleepDur = 1 * time.Second
// 2024/09/14 08:06:29 gossip.go:132: sleeping for 1s (0s)
t.L().Printf("sleeping for %s (%.0fs)\n", sleepDur, timeutil.Since(start).Seconds())
time.Sleep(sleepDur)
}
} and then the next log line we see is:
meaning somehow it took us 27 seconds to
I'm having trouble believing that anything could've delayed this by 26s, however, maybe a time jump is to blame instead. Either way, I'll file this away as an infra flake. |
I raised this question with test-eng in https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1729266529436189. |
cc @cockroachdb/test-eng |
This test has had a string of weird failures where either a `t.L().Printf` call or `time.Sleep(1s)` take dozens of seconds. This PR adds a goroutine that gets spawned right before and, unless signaled within 2s by both the Printf and the Sleep having completed, dumps stacks to stderr. See the main issue cockroachdb#130737. Closes the duplicates across various branches: Closes cockroachdb#132651. Closes cockroachdb#134495. Epic: none Release note: None
134430: roachtest: disk bandwidth limiter test should only asssert on writes r=sumeerbhola a=aadityasondhi Since we do not pace reads yet, the test will remain flaky in this assertion, as the system can see unbounded read bandwidth usage and fail the assertion even if writes are paced. Fixes #131484 Release note: None 134527: roachtest: add debugging to gossip/chaos r=tbg a=tbg This test has had a string of weird failures where either a `t.L().Printf` call or `time.Sleep(1s)` take dozens of seconds. This PR adds a goroutine that gets spawned right before and, unless signaled within 2s by both the Printf and the Sleep having completed, dumps stacks to stderr. See the main issue #130737. Closes the duplicates across various branches: Closes #132651. Closes #134495. Epic: none Release note: None 134751: lease: dump stacks if TestDescriptorRefreshOnRetry fails r=rafiss a=rafiss We added additional logging to help debug a source of flakiness in which the acquisition counts exceed the number of release counts. For that logging to be useful, we need to know the goroutine IDs and stacks. Marking this as fixing the linked issue so that the next time it fails, we are reminded to look at the logs. fixes: #134695 Release note: None 134953: kvserver/rangefeed: rename Disconnect to SendError for stream interface r=tbg,stevendanna a=wenyihu6 This patch renames `Disconnect` to `SendError` in the `rangefeed.Stream` interface to clarify its role for sending errors, distinguishing it from other similarly named functions like `registration.disconnect`. Part of: #110432 Release note: none Co-authored-by: Steven Danna [email protected] Co-authored-by: Aaditya Sondhi <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Wenyi Hu <[email protected]>
We've seen unexpected clock jumps in this test's failures. Live migrations are one explanation for this, so let's disable them for this test. NB: Not closing out the linked issue as it happened on Azure, and this termination policy only applies to GCE. Informs cockroachdb#130737 Release note: None
We've seen unexpected clock jumps in this test's failures. Live migrations are one explanation for this, so let's disable them for this test. NB: Not closing out the linked issue as it happened on Azure, and this termination policy only applies to GCE. Informs cockroachdb#130737 Release note: None
138549: roachtest: terminate gossip roachtest on VM migration r=arulajmani a=arulajmani We've seen unexpected clock jumps in this test's failures. Live migrations are one explanation for this, so let's disable them for this test. NB: Not closing out the linked issue as it happened on Azure, and this termination policy only applies to GCE. Informs #130737 Release note: None Co-authored-by: Arul Ajmani <[email protected]>
We've seen unexpected clock jumps in this test's failures. Live migrations are one explanation for this, so let's disable them for this test. NB: Not closing out the linked issue as it happened on Azure, and this termination policy only applies to GCE. Informs #130737 Release note: None
We've seen unexpected clock jumps in this test's failures. Live migrations are one explanation for this, so let's disable them for this test. NB: Not closing out the linked issue as it happened on Azure, and this termination policy only applies to GCE. Informs #130737 Release note: None
We've seen unexpected clock jumps in this test's failures. Live migrations are one explanation for this, so let's disable them for this test. NB: Not closing out the linked issue as it happened on Azure, and this termination policy only applies to GCE. Informs #130737 Release note: None
We've seen unexpected clock jumps in this test's failures. Live migrations are one explanation for this, so let's disable them for this test. NB: Not closing out the linked issue as it happened on Azure, and this termination policy only applies to GCE. Informs #130737 Release note: None
I'll reopen this because until the problem is addressed, the test may fail again, potentially causing more effort to re-determine that all we need to do here is wait. Will assign back to T-kv. |
roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 5cc8013ada42f9ea03eda661a4f178c141f4f24d:
Parameters:
ROACHTEST_arch=amd64
ROACHTEST_cloud=azure
ROACHTEST_coverageBuild=false
ROACHTEST_cpu=4
ROACHTEST_encrypted=false
ROACHTEST_runtimeAssertionsBuild=false
ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
This test on roachdash | Improve this report!
Jira issue: CRDB-42194
The text was updated successfully, but these errors were encountered: