Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: gossip/chaos/nodes=9 failed [needs #138904] #130737

Open
cockroach-teamcity opened this issue Sep 14, 2024 · 5 comments
Open

roachtest: gossip/chaos/nodes=9 failed [needs #138904] #130737

cockroach-teamcity opened this issue Sep 14, 2024 · 5 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 14, 2024

roachtest.gossip/chaos/nodes=9 failed with artifacts on master @ 5cc8013ada42f9ea03eda661a4f178c141f4f24d:

(gossip.go:81).2: gossip did not stabilize (dead node 7) in 26.8s
test artifacts and logs in: /artifacts/gossip/chaos/nodes=9/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-42194

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Sep 14, 2024
@tbg
Copy link
Member

tbg commented Sep 16, 2024

Well - this is interesting. A story best told in a code-output interleaving:

/pkg/cmd/roachtest/tests/gossip.go#L122-L135

		waitForGossip := func(deadNode int) {
// 2024/09/14 08:06:29 gossip.go:123: test status: waiting for gossip to exclude dead node 7
			t.Status(fmt.Sprintf("waiting for gossip to exclude dead node %d", deadNode))
			start := timeutil.Now()
			for {
// 2024/09/14 08:06:29 gossip.go:126: checking if gossip excludes dead node 7 (0s)
				t.L().Printf("checking if gossip excludes dead node %d (%.0fs)\n",
					deadNode, timeutil.Since(start).Seconds())
// 2024/09/14 08:06:29 gossip.go:88: 1: checking gossip
// 2024/09/14 08:06:29 gossip.go:92: 1: gossip not ok (dead node 7 present) (0s)
				if gossipOK(start, deadNode) {
					return
				}
				const sleepDur = 1 * time.Second
// 2024/09/14 08:06:29 gossip.go:132: sleeping for 1s (0s)
				t.L().Printf("sleeping for %s (%.0fs)\n", sleepDur, timeutil.Since(start).Seconds())
				time.Sleep(sleepDur)
			}
		}

and then the next log line we see is:

2024/09/14 08:06:56 gossip.go:126: checking if gossip excludes dead node 7 (27s)

meaning somehow it took us 27 seconds to time.Sleep(time.Second) and jump to the beginning of the for loop. The subsequent call to gossipOK then fails, since it calls t.Fatal if it's been >= 20s:

2024/09/14 08:06:56 test_impl.go:423: test failure #1: full stack retained in failure_1.log: (gossip.go:81).2: gossip did not stabilize (dead node 7) in 26.8s

I'm having trouble believing that anything could've delayed this by 26s, however, maybe a time jump is to blame instead. Either way, I'll file this away as an infra flake.

@tbg tbg added X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Sep 16, 2024
@arulajmani arulajmani added the P-3 Issues/test failures with no fix SLA label Sep 30, 2024
@nvanbenschoten
Copy link
Member

I raised this question with test-eng in https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1729266529436189.

@nvanbenschoten nvanbenschoten added T-testeng TestEng Team and removed T-kv KV Team labels Oct 18, 2024
Copy link

blathers-crl bot commented Oct 18, 2024

cc @cockroachdb/test-eng

tbg added a commit to tbg/cockroach that referenced this issue Nov 7, 2024
This test has had a string of weird failures where
either a `t.L().Printf` call or `time.Sleep(1s)`
take dozens of seconds.

This PR adds a goroutine that gets spawned right
before and, unless signaled within 2s by both
the Printf and the Sleep having completed, dumps
stacks to stderr.

See the main issue cockroachdb#130737.
Closes the duplicates across various branches:

Closes cockroachdb#132651.
Closes cockroachdb#134495.

Epic: none
Release note: None
craig bot pushed a commit that referenced this issue Nov 12, 2024
134430: roachtest: disk bandwidth limiter test should only asssert on writes r=sumeerbhola a=aadityasondhi

Since we do not pace reads yet, the test will remain flaky in this assertion, as the system can see unbounded read bandwidth usage and fail the assertion even if writes are paced.

Fixes #131484

Release note: None

134527: roachtest: add debugging to gossip/chaos r=tbg a=tbg

This test has had a string of weird failures where
either a `t.L().Printf` call or `time.Sleep(1s)`
take dozens of seconds.

This PR adds a goroutine that gets spawned right
before and, unless signaled within 2s by both
the Printf and the Sleep having completed, dumps
stacks to stderr.

See the main issue #130737.
Closes the duplicates across various branches:

Closes #132651.
Closes #134495.

Epic: none
Release note: None


134751: lease: dump stacks if TestDescriptorRefreshOnRetry fails r=rafiss a=rafiss

We added additional logging to help debug a source of flakiness in which
the acquisition counts exceed the number of release counts. For that
logging to be useful, we need to know the goroutine IDs and stacks.

Marking this as fixing the linked issue so that the next time it fails,
we are reminded to look at the logs.

fixes: #134695
Release note: None

134953: kvserver/rangefeed: rename Disconnect to SendError for stream interface r=tbg,stevendanna a=wenyihu6

This patch renames `Disconnect` to `SendError` in the
`rangefeed.Stream` interface to clarify its role for sending
errors, distinguishing it from other similarly named
functions like `registration.disconnect`.

Part of: #110432
Release note: none

Co-authored-by: Steven Danna [email protected]

Co-authored-by: Aaditya Sondhi <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: Wenyi Hu <[email protected]>
arulajmani added a commit to arulajmani/cockroach that referenced this issue Jan 7, 2025
We've seen unexpected clock jumps in this test's failures. Live
migrations are one explanation for this, so let's disable them for this
test.

NB: Not closing out the linked issue as it happened on Azure, and this
termination policy only applies to GCE.

Informs cockroachdb#130737

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Jan 7, 2025
We've seen unexpected clock jumps in this test's failures. Live
migrations are one explanation for this, so let's disable them for this
test.

NB: Not closing out the linked issue as it happened on Azure, and this
termination policy only applies to GCE.

Informs cockroachdb#130737

Release note: None
craig bot pushed a commit that referenced this issue Jan 8, 2025
138549: roachtest: terminate gossip roachtest on VM migration r=arulajmani a=arulajmani

We've seen unexpected clock jumps in this test's failures. Live migrations are one explanation for this, so let's disable them for this test.

NB: Not closing out the linked issue as it happened on Azure, and this termination policy only applies to GCE.

Informs #130737

Release note: None

Co-authored-by: Arul Ajmani <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Jan 8, 2025
We've seen unexpected clock jumps in this test's failures. Live
migrations are one explanation for this, so let's disable them for this
test.

NB: Not closing out the linked issue as it happened on Azure, and this
termination policy only applies to GCE.

Informs #130737

Release note: None
blathers-crl bot pushed a commit that referenced this issue Jan 8, 2025
We've seen unexpected clock jumps in this test's failures. Live
migrations are one explanation for this, so let's disable them for this
test.

NB: Not closing out the linked issue as it happened on Azure, and this
termination policy only applies to GCE.

Informs #130737

Release note: None
blathers-crl bot pushed a commit that referenced this issue Jan 8, 2025
We've seen unexpected clock jumps in this test's failures. Live
migrations are one explanation for this, so let's disable them for this
test.

NB: Not closing out the linked issue as it happened on Azure, and this
termination policy only applies to GCE.

Informs #130737

Release note: None
blathers-crl bot pushed a commit that referenced this issue Jan 8, 2025
We've seen unexpected clock jumps in this test's failures. Live
migrations are one explanation for this, so let's disable them for this
test.

NB: Not closing out the linked issue as it happened on Azure, and this
termination policy only applies to GCE.

Informs #130737

Release note: None
@DarrylWong
Copy link
Contributor

We have #136783 to track reporting of live migrations and #138904 to track the TC runner being overloaded so going to close this issue out

@tbg
Copy link
Member

tbg commented Jan 30, 2025

I'll reopen this because until the problem is addressed, the test may fail again, potentially causing more effort to re-determine that all we need to do here is wait.

Will assign back to T-kv.

@tbg tbg reopened this Jan 30, 2025
@tbg tbg added T-kv KV Team and removed T-testeng TestEng Team labels Jan 30, 2025
@tbg tbg changed the title roachtest: gossip/chaos/nodes=9 failed roachtest: gossip/chaos/nodes=9 failed [needs #138904] Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-3 Issues/test failures with no fix SLA T-kv KV Team X-infra-flake the automatically generated issue was closed due to an infrastructure problem not a product issue
Projects
None yet
Development

No branches or pull requests

5 participants