Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicagc-changed-peers/restart=false failed #66944

Closed
cockroach-teamcity opened this issue Jun 28, 2021 · 8 comments
Closed

roachtest: replicagc-changed-peers/restart=false failed #66944

cockroach-teamcity opened this issue Jun 28, 2021 · 8 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

roachtest.replicagc-changed-peers/restart=false failed with artifacts on release-21.1 @ ede1628bd625a07b3f966a36ea841009803fc8a9:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicagc-changed-peers/restart=false/run_1
	replicagc.go:175,replicagc.go:81,replicagc.go:30,test_runner.go:728: replica count on n1 didn't drop to zero: 97

	cluster.go:1667,context.go:89,cluster.go:1656,test_runner.go:809: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3125324-1624860626-37-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: dead
		5: 11319
		6: 11891
		4: 11448
		2: 11434
		1: 12247
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh replicagc-changed-peers/restart=false

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jun 28, 2021
@cockroach-teamcity cockroach-teamcity added this to the 21.1 milestone Jun 28, 2021
@tbg
Copy link
Member

tbg commented Jun 29, 2021

This also failed on master, and the failure there is linked to #63999, which this build already had the backport of (#63999).

Looking at the logs here, we see the same problem - n4 to n6 are throttled throughout the test. But this makes sense because everyone seems to lose liveness and never get it back:

  • 8:17:52: n3 goes down
  • 8:17:58: heartbeats for all remaining nodes fail:

    teamcity-3125324-1624860626-37-n6cpu4-0002> W210628 08:18:12.093032 190 kv/kvserver/liveness/liveness.go:723 ⋮ [n2,liveness-hb] 647 failed node liveness heartbeat: ‹operation "node liveness heartbeat" timed out after 4.5s›: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n3: breaker open›, though often it's just opaquely context deadline exceeded

  • 8:22:52 test terminates (liveness problems persist)

something pretty bad is going on here. We need to understand what.

@tbg
Copy link
Member

tbg commented Jun 29, 2021

I looked at the stack traces hoping this was the deadlock we fixed the other day, but the stacks are... very boring. Nothing going on really.

@tbg
Copy link
Member

tbg commented Jun 29, 2021

The liveness range is healthy (two out of three replicas remaining, the down one being n3 which is by design in this test).

@tbg
Copy link
Member

tbg commented Jun 30, 2021

Looking at a repro now. We lost liveness for 5 minutes, the liveness heartbeats fail (on all nodes) with

failed node liveness heartbeat: ‹operation "node liveness heartbeat" timed out after 4.5s›: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n3: breaker open›

this eventually resolves.

The error comes from here

for {
if transport.IsExhausted() {
return noMoreReplicasErr(ambiguousError, lastErr)
}
if _, ok := routing.Desc().GetReplicaDescriptorByID(transport.NextReplica().ReplicaID); ok {
return nil
}
transport.SkipReplica()
}

This basically has to be some kind of cache invalidation problem? When we see a SendError, we eject the range descriptor and on a retry we shouldn't be seeing n3 in it - I checked the range desc and it's on n4,n5,n6 (with both lease and raft leadership on n5). Why are we continuing to try to touch n3?

@tbg
Copy link
Member

tbg commented Jun 30, 2021

Going to take this as an opportunity to test out #65844. Started another repro with that cherry-picked & called in the test at hand. That should tell us what goes on inside of DistSender for these liveness attempts.

@tbg tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jul 8, 2021
@tbg
Copy link
Member

tbg commented Jul 8, 2021

@lunevalex I took the release-blocker label off of this since we don't think that this was introduced recently, please comment if you disagree with this.

@cockroach-teamcity
Copy link
Member Author

roachtest.replicagc-changed-peers/restart=false failed with artifacts on release-21.1 @ c4d0e7baee3925541eed599ae771abb95c97732b:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/replicagc-changed-peers/restart=false/run_1
	replicagc.go:175,replicagc.go:81,replicagc.go:30,test_runner.go:733: replica count on n1 didn't drop to zero: 91

	cluster.go:1667,context.go:89,cluster.go:1656,test_runner.go:820: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3290967-1628748601-34-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 5: 12748
		3: dead (exit status 137)
		6: 12766
		1: 13520
		2: 12770
		4: 13396
		Error: UNCLASSIFIED_PROBLEM: 3: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (3) 3: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh replicagc-changed-peers/restart=false

Same failure on other branches

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

@tbg
Copy link
Member

tbg commented Aug 25, 2021

Seems to be fixed, I think it must've been the roachtest broken env var passing that we had for a while.

@tbg tbg closed this as completed Aug 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

2 participants