Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicagc-changed-peers/restart=false failed #67910

Closed
cockroach-teamcity opened this issue Jul 22, 2021 · 8 comments · Fixed by #67916
Closed

roachtest: replicagc-changed-peers/restart=false failed #67910

cockroach-teamcity opened this issue Jul 22, 2021 · 8 comments · Fixed by #67916
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@cockroach-teamcity
Copy link
Member

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ f0e2aa6abbbbf3318ea20e7dbcbe40819a809b83:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3209830-1626934742-34-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 1: dead (exit status 137)
		2: dead (exit status 137)
		6: 11613
		4: 12794
		5: 11907
		3: 12657
		Error: UNCLASSIFIED_PROBLEM: 1: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 2: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 1: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh replicagc-changed-peers/restart=false

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 22, 2021
@tbg
Copy link
Member

tbg commented Jul 22, 2021

@lunevalex this failure had #67319. We're not seeing liveness fail here. The problem is really here:

09:28:12 test_impl.go:310: test failure: replicagc.go:185,replicagc.go:139,replicagc.go:34,test_runner.go:765: replica count on n3 didn't drop to zero: 23

$ ls debug/nodes/3/ranges/
10.json  12.json  19.json  20.json  22.json  27.json  29.json  30.json  36.json  38.json  5.json  9.json
11.json  17.json  1.json   21.json  23.json  28.json  2.json   33.json  37.json  39.json  7.json
$ ls debug/nodes/3/ranges/ | wc -l
23

Note that these are all "small" rangeIDs. They all belong to system tables, r39 is also the last such range. The system tables are all supposed to be 5x-replicated. The logs show these messages over minutes:

teamcity-3209830-1626934742-34-n6cpu4-0005> E210722 09:27:57.425353 546408 kv/kvserver/queue.go:1098 ⋮ [n5,replicate,s5,r9/5:‹/Table/1{3-4}›] 8559 avoid up-replicating to fragile quorum: 2 matching stores are currently throttled: [‹throttled because the node is considered suspect› ‹throttled because the node is considered suspect›]

These ranges are generally on n3, n4, n5 (recall that n1 and n2 are down at this point in the test). We check the replica count for five minutes, which happens to match the time until store dead. I'm not sure how this ever worked, as this can only work if the adaptive replication factor becomes three (instead of five as mandated by zone), but investigating the code here

// Node count including dead nodes but excluding
// decommissioning/decommissioned nodes.
clusterNodes := a.storePool.ClusterNodeCount()
neededVoters := GetNeededVoters(zone.GetNumVoters(), clusterNodes)

shows that in this situation (two dead but not decommission{ed,ing} nodes, four live nodes) we'll ask for the full five replicas and "always" have.

Earlier in the test, we have the system ranges on n1-n3 (adaptive replication factor is three since we haven't started n4-n6 yet) and then we go and stop n3, initiate decommissioning on n1-n3, and start n4-n6. This will keep the adaptive replication factor at three, since decommissioning nodes are excluded, so in principle all replicas would migrate to n4-n6. However, the test never waits for n3 to lose all replicas. It only does so for n1 and n2, and stops both when done (the reason why it doesn't check n3 is because it uses the node metrics to find out, but n3 is intentionally down; we could check the meta ranges instead). So when the test proceeds, it may end up with some system ranges on n3 and two out of n4-n6. We then recommission n1-n3 (but don't restart n1-n2) and restart n3. n1 and n2 are now down but not decommissioning, so the adaptive replication factor goes back to 5 and the replicas that are on n3 and n4-n6 can't move, as there aren't five live nodes in the system.

Long story short, this failure makes sense, but why are we seeing it now, and likely most of the time (since the restart=true flavor, for which the same analysis holds, just failed as well - #67914)?

tbg added a commit to tbg/cockroach that referenced this issue Jul 22, 2021
The test ends up in the following situation:

n1: down, no replicas
n2: down, no replicas
n3: alive, with constraint that wants all replicas to move,
    and there may be a few ranges still on n3
n4-n6: alive

where the ranges predominantly 3x-replicated.

The test is then verifying that the replica count (as in, replicas on
n3, in contrast to replicas assigned via the meta ranges) on n3 drops to
zero.

However, system ranges cannot move in this configuration. The number of
cluster nodes is six (decommission{ing,ed} nodes would be excluded, but
no nodes are decommission{ing,ed} here) and so the system ranges operate
at a replication factor of five. There are only four live nodes here, so
if n3 is still a member of any system ranges, they will stay there and
the test fails.

This commit attempts to rectify that by making sure that while n3 is
down earlier in the test, all replicas are moved from it. That was
always the intent of the test, which is concerned with n3 realizing
that replicas have moved elsewhere and initiating replicaGC; however
prior to this commit it was always left to chance whether n3 would
or would not have replicas assigned to it by the time the test moved
to the stage above. The reason the test wasn't previously waiting
for all replicas to be moved off n3 while it was down was that it
required checking the meta ranges, which wasn't necessary for the
other two nodes.

This commit passed all five runs of
replicagc-changed-peers/restart=false, so I think it reliably addresses
the problem.

There is still the lingering question of why this is failing only now
(note that both flavors of the test failed on master last night, so
I doubt it is rare). We just merged
cockroachdb#67319 which is likely
somehow related.

Fixes cockroachdb#67910.
Fixes cockroachdb#67914.

Release note: None
@tbg
Copy link
Member

tbg commented Jul 22, 2021

Feels like this test must've changed behavior as a result of #67319. I'm not exactly sure how.

@tbg
Copy link
Member

tbg commented Jul 22, 2021

Fails 5/5 on c3049f4. Passes 5/5 when I revert ab15a0a (#67319). So it's pretty conclusive that that PR changed something. @lunevalex any idea what exactly? I assume that in the earlier phase of the test (n1-n3 decommissioning, n3 down, n4-n6 just started) we are somehow ensuring that no system range ends up on n3 by the time n1 and n2 have shed all their replicas. Maybe we were previously giving priority to moving replicas off n3 (which is down) rather than n1 and n2, and so checking that n1/n2 were empty implied that n3 was empty?

@lunevalex
Copy link
Collaborator

lunevalex commented Jul 22, 2021

@tbg very interesting, I wonder if #67714 will fix this because @aayushshah15 saw something very similar at a customer.

@cockroach-teamcity
Copy link
Member Author

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ 5e46fd88b11007ddaf0b5350ed28d11b0c3bfdaf:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3215044-1627021228-37-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 1: dead (exit status 137)
		2: dead (exit status 137)
		6: 11656
		3: 11962
		4: 11438
		5: 12089
		Error: UNCLASSIFIED_PROBLEM: 1: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 2: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 1: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError
Reproduce

To reproduce, try:

## Simple repro (linux-only):
  $ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
  $ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local

## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md.

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ b02d22f9b3d30a0288ad1d8464dd6f2d82c08f0d:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3219225-1627107462-35-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead (exit status 137)
		4: 11914
		3: 12309
		6: 11535
		1: dead (exit status 137)
		5: 11312
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 1: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 1: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError
Reproduce

To reproduce, try:

## Simple repro (linux-only):
  $ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
  $ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local

## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md.

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ 9baaa282b3a09977b96bd3e5ae6e2346adfa2c16:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3221075-1627196139-34-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead (exit status 137)
		1: dead (exit status 137)
		3: 12281
		4: 11480
		6: 11215
		5: 11737
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 1: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 1: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError
Reproduce

To reproduce, try:

## Simple repro (linux-only):
  $ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
  $ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local

## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md.

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ f7528c59e296ed9acd2a20d590f2a42bbad0dcd0:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3222803-1627279934-37-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead (exit status 137)
		1: dead (exit status 137)
		5: 11794
		6: 11490
		3: 11749
		4: 12402
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 1: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 1: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError
Reproduce

To reproduce, try:

## Simple repro (linux-only):
  $ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
  $ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local

## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md.

Same failure on other branches

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Jul 26, 2021
67526: roachtest: make timeout obvious in posted issues r=stevendanna a=tbg

When a test times out, roachtest will rip the cluster out from under it
to try to force it to terminate. This is essentially guaranteed to
produce a posted issue that sweeps the original reason of the failure
(the timeout) under the rug. Instead, such issues now plainly state
that there was a timeout and refer the readers to the artifacts.

See here for an example issue without this fix: #67464

cc @dt, who pointed this out [internally]

[internally]: https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1626098863019500

Release note: None


67824: dev: teach `dev` how to do cross builds r=rail a=rickystewart

Closes #67709.

Release note: None

67825: changefeedccl: immediately stop sending webhook sink rows upon error r=spiffyyeng a=spiffyyeng

Previously, the sink waited until flushing to acknowledge HTTP errors, leaving
any messages between the initial error and flush to potentially be out of
order. Now, errors are checked before each message is sent and the sink is
restarted if one is detected to maintain ordering.

Resolves #67772

Release note: None

67894: sql: add support for unique expression indexes r=mgartner a=mgartner

Release note: None

67916: roachtest: fix replicagc-changed-peers r=aliher1911 a=tbg

The test ends up in the following situation:

n1: down, no replicas
n2: down, no replicas
n3: alive, with constraint that wants all replicas to move,
    and there may be a few ranges still on n3
n4-n6: alive

where the ranges predominantly 3x-replicated.

The test is then verifying that the replica count (as in, replicas on
n3, in contrast to replicas assigned via the meta ranges) on n3 drops to
zero.

However, system ranges cannot move in this configuration. The number of
cluster nodes is six (decommission{ing,ed} nodes would be excluded, but
no nodes are decommission{ing,ed} here) and so the system ranges operate
at a replication factor of five. There are only four live nodes here, so
if n3 is still a member of any system ranges, they will stay there and
the test fails.

This commit attempts to rectify that by making sure that while n3 is
down earlier in the test, all replicas are moved from it. That was
always the intent of the test, which is concerned with n3 realizing
that replicas have moved elsewhere and initiating replicaGC; however
prior to this commit it was always left to chance whether n3 would
or would not have replicas assigned to it by the time the test moved
to the stage above. The reason the test wasn't previously waiting
for all replicas to be moved off n3 while it was down was that it
required checking the meta ranges, which wasn't necessary for the
other two nodes.

This commit passed all five runs of
replicagc-changed-peers/restart=false, so I think it reliably addresses
the problem.

There is still the lingering question of why this is failing only now
(note that both flavors of the test failed on master last night, so
I doubt it is rare). We just merged
#67319 which is likely
somehow related.

Fixes #67910.
Fixes #67914.

Release note: None


67961: bazel: use `action_config`s over `tool_path`s in cross toolchains r=rail a=rickystewart

This doesn't change much in practice, but does allow us to use the
actual `g++` compiler for C++ compilation, which wasn't the case
before.

The `tool_path` constructor is actually [deprecated](https://github.com/bazelbuild/bazel/blob/203aa773d7109a0bcd9777ba6270bd4fd0edb69f/tools/cpp/cc_toolchain_config_lib.bzl#L419)
in favor of `action_config`s, so this is future-proofing.

Release note: None

67962: bazel: start building geos in ci r=rail a=rickystewart

Only the most recent commit applies for this review --
the other is from #67961.

Closes #66388.

Release note: None

68065: cli: skip TestRemoveDeadReplicas r=irfansharif a=tbg

Refs: #50977

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Ryan Min <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
@craig craig bot closed this as completed in 01a3fd5 Jul 26, 2021
tbg added a commit to tbg/cockroach that referenced this issue Sep 20, 2021
The test ends up in the following situation:

n1: down, no replicas
n2: down, no replicas
n3: alive, with constraint that wants all replicas to move,
    and there may be a few ranges still on n3
n4-n6: alive

where the ranges predominantly 3x-replicated.

The test is then verifying that the replica count (as in, replicas on
n3, in contrast to replicas assigned via the meta ranges) on n3 drops to
zero.

However, system ranges cannot move in this configuration. The number of
cluster nodes is six (decommission{ing,ed} nodes would be excluded, but
no nodes are decommission{ing,ed} here) and so the system ranges operate
at a replication factor of five. There are only four live nodes here, so
if n3 is still a member of any system ranges, they will stay there and
the test fails.

This commit attempts to rectify that by making sure that while n3 is
down earlier in the test, all replicas are moved from it. That was
always the intent of the test, which is concerned with n3 realizing
that replicas have moved elsewhere and initiating replicaGC; however
prior to this commit it was always left to chance whether n3 would
or would not have replicas assigned to it by the time the test moved
to the stage above. The reason the test wasn't previously waiting
for all replicas to be moved off n3 while it was down was that it
required checking the meta ranges, which wasn't necessary for the
other two nodes.

This commit passed all five runs of
replicagc-changed-peers/restart=false, so I think it reliably addresses
the problem.

There is still the lingering question of why this is failing only now
(note that both flavors of the test failed on master last night, so
I doubt it is rare). We just merged
cockroachdb#67319 which is likely
somehow related.

Fixes cockroachdb#67910.
Fixes cockroachdb#67914.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants