roachtest: replicagc-changed-peers/restart=false failed #67910

cockroach-teamcity · 2021-07-22T09:28:51Z

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ f0e2aa6abbbbf3318ea20e7dbcbe40819a809b83:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3209830-1626934742-34-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 1: dead (exit status 137)
		2: dead (exit status 137)
		6: 11613
		4: 12794
		5: 11907
		3: 12657
		Error: UNCLASSIFIED_PROBLEM: 1: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 2: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 1: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh replicagc-changed-peers/restart=false

Same failure on other branches

roachtest: replicagc-changed-peers/restart=false failed #66944 roachtest: replicagc-changed-peers/restart=false failed [C-test-failure O-roachtest O-robot branch-release-21.1]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

The text was updated successfully, but these errors were encountered:

tbg · 2021-07-22T11:58:50Z

@lunevalex this failure had #67319. We're not seeing liveness fail here. The problem is really here:

09:28:12 test_impl.go:310: test failure: replicagc.go:185,replicagc.go:139,replicagc.go:34,test_runner.go:765: replica count on n3 didn't drop to zero: 23

$ ls debug/nodes/3/ranges/
10.json  12.json  19.json  20.json  22.json  27.json  29.json  30.json  36.json  38.json  5.json  9.json
11.json  17.json  1.json   21.json  23.json  28.json  2.json   33.json  37.json  39.json  7.json
$ ls debug/nodes/3/ranges/ | wc -l
23

Note that these are all "small" rangeIDs. They all belong to system tables, r39 is also the last such range. The system tables are all supposed to be 5x-replicated. The logs show these messages over minutes:

teamcity-3209830-1626934742-34-n6cpu4-0005> E210722 09:27:57.425353 546408 kv/kvserver/queue.go:1098 ⋮ [n5,replicate,s5,r9/5:‹/Table/1{3-4}›] 8559 avoid up-replicating to fragile quorum: 2 matching stores are currently throttled: [‹throttled because the node is considered suspect› ‹throttled because the node is considered suspect›]

These ranges are generally on n3, n4, n5 (recall that n1 and n2 are down at this point in the test). We check the replica count for five minutes, which happens to match the time until store dead. I'm not sure how this ever worked, as this can only work if the adaptive replication factor becomes three (instead of five as mandated by zone), but investigating the code here

cockroach/pkg/kv/kvserver/allocator.go

Lines 526 to 529 in 80bc7f6

    
           // Node count including dead nodes but excluding 
        
           // decommissioning/decommissioned nodes. 
        
           clusterNodes := a.storePool.ClusterNodeCount() 
        
           neededVoters := GetNeededVoters(zone.GetNumVoters(), clusterNodes)

shows that in this situation (two dead but not decommission{ed,ing} nodes, four live nodes) we'll ask for the full five replicas and "always" have.

Earlier in the test, we have the system ranges on n1-n3 (adaptive replication factor is three since we haven't started n4-n6 yet) and then we go and stop n3, initiate decommissioning on n1-n3, and start n4-n6. This will keep the adaptive replication factor at three, since decommissioning nodes are excluded, so in principle all replicas would migrate to n4-n6. However, the test never waits for n3 to lose all replicas. It only does so for n1 and n2, and stops both when done (the reason why it doesn't check n3 is because it uses the node metrics to find out, but n3 is intentionally down; we could check the meta ranges instead). So when the test proceeds, it may end up with some system ranges on n3 and two out of n4-n6. We then recommission n1-n3 (but don't restart n1-n2) and restart n3. n1 and n2 are now down but not decommissioning, so the adaptive replication factor goes back to 5 and the replicas that are on n3 and n4-n6 can't move, as there aren't five live nodes in the system.

Long story short, this failure makes sense, but why are we seeing it now, and likely most of the time (since the restart=true flavor, for which the same analysis holds, just failed as well - #67914)?

The test ends up in the following situation: n1: down, no replicas n2: down, no replicas n3: alive, with constraint that wants all replicas to move, and there may be a few ranges still on n3 n4-n6: alive where the ranges predominantly 3x-replicated. The test is then verifying that the replica count (as in, replicas on n3, in contrast to replicas assigned via the meta ranges) on n3 drops to zero. However, system ranges cannot move in this configuration. The number of cluster nodes is six (decommission{ing,ed} nodes would be excluded, but no nodes are decommission{ing,ed} here) and so the system ranges operate at a replication factor of five. There are only four live nodes here, so if n3 is still a member of any system ranges, they will stay there and the test fails. This commit attempts to rectify that by making sure that while n3 is down earlier in the test, all replicas are moved from it. That was always the intent of the test, which is concerned with n3 realizing that replicas have moved elsewhere and initiating replicaGC; however prior to this commit it was always left to chance whether n3 would or would not have replicas assigned to it by the time the test moved to the stage above. The reason the test wasn't previously waiting for all replicas to be moved off n3 while it was down was that it required checking the meta ranges, which wasn't necessary for the other two nodes. This commit passed all five runs of replicagc-changed-peers/restart=false, so I think it reliably addresses the problem. There is still the lingering question of why this is failing only now (note that both flavors of the test failed on master last night, so I doubt it is rare). We just merged cockroachdb#67319 which is likely somehow related. Fixes cockroachdb#67910. Fixes cockroachdb#67914. Release note: None

tbg · 2021-07-22T13:16:19Z

Feels like this test must've changed behavior as a result of #67319. I'm not exactly sure how.

tbg · 2021-07-22T13:54:37Z

Fails 5/5 on c3049f4. Passes 5/5 when I revert ab15a0a (#67319). So it's pretty conclusive that that PR changed something. @lunevalex any idea what exactly? I assume that in the earlier phase of the test (n1-n3 decommissioning, n3 down, n4-n6 just started) we are somehow ensuring that no system range ends up on n3 by the time n1 and n2 have shed all their replicas. Maybe we were previously giving priority to moving replicas off n3 (which is down) rather than n1 and n2, and so checking that n1/n2 were empty implied that n3 was empty?

lunevalex · 2021-07-22T18:24:34Z

@tbg very interesting, I wonder if #67714 will fix this because @aayushshah15 saw something very similar at a customer.

cockroach-teamcity · 2021-07-23T08:40:11Z

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ 5e46fd88b11007ddaf0b5350ed28d11b0c3bfdaf:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3215044-1627021228-37-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 1: dead (exit status 137)
		2: dead (exit status 137)
		6: 11656
		3: 11962
		4: 11438
		5: 12089
		Error: UNCLASSIFIED_PROBLEM: 1: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 2: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 1: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

Reproduce

To reproduce, try:

## Simple repro (linux-only):
  $ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
  $ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local

## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md.

Same failure on other branches

roachtest: replicagc-changed-peers/restart=false failed #66944 roachtest: replicagc-changed-peers/restart=false failed [C-test-failure O-roachtest O-robot branch-release-21.1]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-07-24T09:05:33Z

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ b02d22f9b3d30a0288ad1d8464dd6f2d82c08f0d:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3219225-1627107462-35-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead (exit status 137)
		4: 11914
		3: 12309
		6: 11535
		1: dead (exit status 137)
		5: 11312
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 1: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 1: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

Reproduce

To reproduce, try:

## Simple repro (linux-only):
  $ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
  $ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local

## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md.

Same failure on other branches

roachtest: replicagc-changed-peers/restart=false failed #66944 roachtest: replicagc-changed-peers/restart=false failed [C-test-failure O-roachtest O-robot branch-release-21.1]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-07-25T09:01:56Z

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ 9baaa282b3a09977b96bd3e5ae6e2346adfa2c16:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3221075-1627196139-34-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead (exit status 137)
		1: dead (exit status 137)
		3: 12281
		4: 11480
		6: 11215
		5: 11737
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 1: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 1: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

Reproduce

To reproduce, try:

## Simple repro (linux-only):
  $ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
  $ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local

## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md.

Same failure on other branches

roachtest: replicagc-changed-peers/restart=false failed #66944 roachtest: replicagc-changed-peers/restart=false failed [C-test-failure O-roachtest O-robot branch-release-21.1]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

cockroach-teamcity · 2021-07-26T07:56:34Z

roachtest.replicagc-changed-peers/restart=false failed with artifacts on master @ f7528c59e296ed9acd2a20d590f2a42bbad0dcd0:

	cluster.go:1230,context.go:89,cluster.go:1218,test_runner.go:854: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-3222803-1627279934-37-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: dead (exit status 137)
		1: dead (exit status 137)
		5: 11794
		6: 11490
		3: 11749
		4: 12402
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 137)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 1: dead (exit status 137)
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:225
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		  | Wraps: (2) 1: dead (exit status 137)
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1168
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:276
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:856
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:960
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:897
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:2087
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:225
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1371
		Wraps: (4) 2: dead (exit status 137)
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

Reproduce

To reproduce, try:

## Simple repro (linux-only):
  $ make cockroachshort bin/worklaod bin/roachprod bin/roachtest
  $ PATH=$PWD/bin:$PATH roachtest run replicagc-changed-peers/restart=false --local

## Proper repro probably needs more roachtest flags, or running
## the programs remotely on GCE. For more details, refer to
## pkg/cmd/roachtest/README.md.

Same failure on other branches

roachtest: replicagc-changed-peers/restart=false failed #66944 roachtest: replicagc-changed-peers/restart=false failed [C-test-failure O-roachtest O-robot branch-release-21.1]

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

@dt

67526: roachtest: make timeout obvious in posted issues r=stevendanna a=tbg When a test times out, roachtest will rip the cluster out from under it to try to force it to terminate. This is essentially guaranteed to produce a posted issue that sweeps the original reason of the failure (the timeout) under the rug. Instead, such issues now plainly state that there was a timeout and refer the readers to the artifacts. See here for an example issue without this fix: #67464 cc @dt, who pointed this out [internally] [internally]: https://cockroachlabs.slack.com/archives/C023S0V4YEB/p1626098863019500 Release note: None 67824: dev: teach `dev` how to do cross builds r=rail a=rickystewart Closes #67709. Release note: None 67825: changefeedccl: immediately stop sending webhook sink rows upon error r=spiffyyeng a=spiffyyeng Previously, the sink waited until flushing to acknowledge HTTP errors, leaving any messages between the initial error and flush to potentially be out of order. Now, errors are checked before each message is sent and the sink is restarted if one is detected to maintain ordering. Resolves #67772 Release note: None 67894: sql: add support for unique expression indexes r=mgartner a=mgartner Release note: None 67916: roachtest: fix replicagc-changed-peers r=aliher1911 a=tbg The test ends up in the following situation: n1: down, no replicas n2: down, no replicas n3: alive, with constraint that wants all replicas to move, and there may be a few ranges still on n3 n4-n6: alive where the ranges predominantly 3x-replicated. The test is then verifying that the replica count (as in, replicas on n3, in contrast to replicas assigned via the meta ranges) on n3 drops to zero. However, system ranges cannot move in this configuration. The number of cluster nodes is six (decommission{ing,ed} nodes would be excluded, but no nodes are decommission{ing,ed} here) and so the system ranges operate at a replication factor of five. There are only four live nodes here, so if n3 is still a member of any system ranges, they will stay there and the test fails. This commit attempts to rectify that by making sure that while n3 is down earlier in the test, all replicas are moved from it. That was always the intent of the test, which is concerned with n3 realizing that replicas have moved elsewhere and initiating replicaGC; however prior to this commit it was always left to chance whether n3 would or would not have replicas assigned to it by the time the test moved to the stage above. The reason the test wasn't previously waiting for all replicas to be moved off n3 while it was down was that it required checking the meta ranges, which wasn't necessary for the other two nodes. This commit passed all five runs of replicagc-changed-peers/restart=false, so I think it reliably addresses the problem. There is still the lingering question of why this is failing only now (note that both flavors of the test failed on master last night, so I doubt it is rare). We just merged #67319 which is likely somehow related. Fixes #67910. Fixes #67914. Release note: None 67961: bazel: use `action_config`s over `tool_path`s in cross toolchains r=rail a=rickystewart This doesn't change much in practice, but does allow us to use the actual `g++` compiler for C++ compilation, which wasn't the case before. The `tool_path` constructor is actually [deprecated](https://github.com/bazelbuild/bazel/blob/203aa773d7109a0bcd9777ba6270bd4fd0edb69f/tools/cpp/cc_toolchain_config_lib.bzl#L419) in favor of `action_config`s, so this is future-proofing. Release note: None 67962: bazel: start building geos in ci r=rail a=rickystewart Only the most recent commit applies for this review -- the other is from #67961. Closes #66388. Release note: None 68065: cli: skip TestRemoveDeadReplicas r=irfansharif a=tbg Refs: #50977 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Ryan Min <[email protected]> Co-authored-by: Marcus Gartner <[email protected]>

The test ends up in the following situation: n1: down, no replicas n2: down, no replicas n3: alive, with constraint that wants all replicas to move, and there may be a few ranges still on n3 n4-n6: alive where the ranges predominantly 3x-replicated. The test is then verifying that the replica count (as in, replicas on n3, in contrast to replicas assigned via the meta ranges) on n3 drops to zero. However, system ranges cannot move in this configuration. The number of cluster nodes is six (decommission{ing,ed} nodes would be excluded, but no nodes are decommission{ing,ed} here) and so the system ranges operate at a replication factor of five. There are only four live nodes here, so if n3 is still a member of any system ranges, they will stay there and the test fails. This commit attempts to rectify that by making sure that while n3 is down earlier in the test, all replicas are moved from it. That was always the intent of the test, which is concerned with n3 realizing that replicas have moved elsewhere and initiating replicaGC; however prior to this commit it was always left to chance whether n3 would or would not have replicas assigned to it by the time the test moved to the stage above. The reason the test wasn't previously waiting for all replicas to be moved off n3 while it was down was that it required checking the meta ranges, which wasn't necessary for the other two nodes. This commit passed all five runs of replicagc-changed-peers/restart=false, so I think it reliably addresses the problem. There is still the lingering question of why this is failing only now (note that both flavors of the test failed on master last night, so I doubt it is rare). We just merged cockroachdb#67319 which is likely somehow related. Fixes cockroachdb#67910. Fixes cockroachdb#67914. Release note: None

This was referenced Jul 22, 2021

roachtest: replicagc-changed-peers/restart=true failed #67914

Closed

roachtest: fix replicagc-changed-peers #67916

Merged

craig bot closed this as completed in 01a3fd5 Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: replicagc-changed-peers/restart=false failed #67910

roachtest: replicagc-changed-peers/restart=false failed #67910

cockroach-teamcity commented Jul 22, 2021

tbg commented Jul 22, 2021 •

edited

Loading

tbg commented Jul 22, 2021

tbg commented Jul 22, 2021

lunevalex commented Jul 22, 2021 •

edited

Loading

cockroach-teamcity commented Jul 23, 2021

cockroach-teamcity commented Jul 24, 2021

cockroach-teamcity commented Jul 25, 2021

cockroach-teamcity commented Jul 26, 2021

roachtest: replicagc-changed-peers/restart=false failed #67910

roachtest: replicagc-changed-peers/restart=false failed #67910

Comments

cockroach-teamcity commented Jul 22, 2021

tbg commented Jul 22, 2021 • edited Loading

tbg commented Jul 22, 2021

tbg commented Jul 22, 2021

lunevalex commented Jul 22, 2021 • edited Loading

cockroach-teamcity commented Jul 23, 2021

cockroach-teamcity commented Jul 24, 2021

cockroach-teamcity commented Jul 25, 2021

cockroach-teamcity commented Jul 26, 2021

tbg commented Jul 22, 2021 •

edited

Loading

lunevalex commented Jul 22, 2021 •

edited

Loading