Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/bank/node-restart failed #35326

Closed
cockroach-teamcity opened this issue Mar 2, 2019 · 29 comments · Fixed by #40997
Closed

roachtest: acceptance/bank/node-restart failed #35326

cockroach-teamcity opened this issue Mar 2, 2019 · 29 comments · Fixed by #40997
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/032c4980720abc1bdd71e4428e4111e6e6383297

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1158877&tab=buildLog

The test failed on master:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1211: stall detected at round 5, no forward progress for 30s
	cluster.go:1152,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1158877-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1158877-acceptance: starting.............................: signal: killed

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 2, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/de1793532332fb64fca27cafe92d2481d900a5a0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1160394&tab=buildLog

The test failed on master:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1211: stall detected at round 3, no forward progress for 30s
	cluster.go:1192,bank.go:247: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1160394-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1160394-acceptance: stopping and waiting..............................: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a119a3a158725c9e3f9b8084d9398601c0e67007

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog

The test failed on master:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1214: stall detected at round 5, no forward progress for 30s
	cluster.go:1170,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1170795-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1170795-acceptance: starting.............................: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/5b36cc6276340282cb333ff4a9cb4f1fbd6c3348

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189990&tab=buildLog

The test failed on release-2.1:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1214: stall detected at round 2, no forward progress for 30s
	cluster.go:1193,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1189990-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1189990-acceptance: starting.............................: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/7f8a0969e8e9eb7e9fc0d2fe96e03849d30dd561

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1199677&tab=buildLog

The test failed on release-19.1:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1214: stall detected at round 1, no forward progress for 30s
	cluster.go:1193,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1199677-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1199677-acceptance: starting.............................: signal: killed

@tbg
Copy link
Member

tbg commented Apr 1, 2019

It's curious that every time this test fails, there's a pending roachprod start invocation. For the last logs, the node being started is n4 (after having been killed moments earlier). I didn't find anything in any of the logs, except for this:

E190326 06:17:19.126750 346 sql/distsqlrun/flow_registry.go:224  [intExec=read orphaned table leases] flow id:9b544064-4968-49a6-9e94-6995eabba1ac : 1 inbound streams timed out after 10s; propagated error throughout flow
I190326 06:17:19.187538 375 sql/lease.go:1822  released orphaned table lease: {id:53 version:1 expiration:{Time:{wall:495315000 ext:63689178174 loc:<nil>}}}

cc @vivekmenezes / @jordanlewis could this error be responsible here? I'm thinking something about the case in which the node getting killed holds the table lease and for some reason doesn't manage to release it after the restart, blocking writes to the table for north of 30s and thus failing the test.

@tbg
Copy link
Member

tbg commented Apr 1, 2019

Oh hey I found the same error in the logs two up:

E190321 06:17:02.295548 387 sql/distsqlrun/flow_registry.go:230  [intExec=read orphaned table leases] flow id:2ea8fd60-fc32-4691-b82d-75f9fd793767 : 1 inbound streams timed out after 10s; propagated error throughout flow
I190321 06:17:02.365213 352 sql/lease.go:1815  released orphaned table lease: {id:53 version:1 expiration:{Time:{wall:696560000 ext:63688746141 loc:<nil>}}}
I190321 06:17:02.365218 351 sql/lease.go:1815  released orphaned table lease: {id:53 version:1 expiration:{Time:{wall:693236000 ext:63688746081 loc:<nil>}}}

@tbg tbg assigned vivekmenezes and unassigned tbg Apr 1, 2019
@vivekmenezes
Copy link
Contributor

vivekmenezes commented Apr 1, 2019

@tbg this looks like a situation where the "read orphaned table leases" query (ROTL) was attempted a few times and eventually succeeded. The results of that query were leases that were orphaned and were released. But I don't see how this could have blocked writes to the table. There is no schema change happening in this test. Besides all this logic is running asynchronously and not holding up the node from restarting.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c6df752eefe4609b8a5bbada0955f79a2cfb790e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217763&tab=buildLog

The test failed on master:
	bank.go:189,bank.go:501,acceptance.go:78,test.go:1228: dial tcp 35.231.201.221:26257: connect: connection refused
	cluster.go:1295,bank.go:247: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1217763-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1217763-acceptance: stopping and waiting............................................................: signal: killed
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:776,test.go:762,bank.go:189,bank.go:501,acceptance.go:78,test.go:1228: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1217763-acceptance --oneshot --ignore-empty-nodes: exit status 1 1: 5268
		2: 6342
		3: 5244
		4: dead
		Error:  4: dead

@tbg
Copy link
Member

tbg commented Apr 3, 2019

roachprod stop c:4 got stuck in the test, which then triggered the verifyAccounts goroutine to fail:

err := retry.ForDuration(30*time.Second, func() error {
// Hold the read lock on the client to prevent it being restarted by
// chaos monkey.
client.RLock()
defer client.RUnlock()
err := client.db.QueryRow("SELECT count(*), sum(balance) FROM bank.accounts").Scan(&accounts, &sum)
if err != nil && !testutils.IsSQLRetryableError(err) {
t.Fatal(err)
}
return err
})
if err != nil {
t.Fatal(err)
}

The node actually died when the stop was invoked. Unclear to me why it didn't return. @petermattis any idea? The roachprod.log seems to indicate (if I read it correctly) that the process died quickly: https://gist.github.com/tbg/fc3e0b20eacceb7fc65aeff1438b2ccb

@petermattis
Copy link
Collaborator

I don't see anything that would account for the stuck roachprod stop c:4 command. The roachprod.log seems to indicate that pid 5244 is dead. There is no evidence of a subsequent attempt to start the node, either in roachprod.log or test.log or in the node's logs.

The problem ranges report in the debug.zip shows a number of under-replicated ranges, but no unavailable ones:

{
  "node_id": 1,
  "problems_by_node_id": {
    "1": {
      "underreplicated_range_ids": [
        2,
        3,
        5,
        8,
        11,
        12,
        14,
        15,
        18
      ]
    },
    "2": {
      "underreplicated_range_ids": [
        1,
        6,
        16
      ]
    },
    "3": {
      "underreplicated_range_ids": [
        19,
        20,
        21
      ]
    },
    "4": {
      "error_message": "initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.142.15.239:26257: connect: connection refused\""
    }
  }
}

I believe r21 corresponds to the bank table. It is under-replicated because n4 is down. n3 is the leaseholder for that range.

Hmm, this is curious in n3's logs:

W190403 06:15:06.281045 92 gossip/gossip.go:1501  [n3] first range unavailable; trying remaining resolvers

Might be nothing. The debug.zip indicates that n3 was the leaseholder for r1 and gossiped out the first range descriptor.

Seems like the error from verifyAccounts wasn't printed anywhere. Did I just miss it?

@petermattis
Copy link
Collaborator

I'm not seeing anything wedged in the goroutines on n1, but I believe the client connections have been closed by the time the debug.zip is generated. Nothing interesting on n3 either. Hrmm.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/6da68d7fe2c9a29b85e2ec0c7e545a0d6bdc4c5c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1226521&tab=buildLog

The test failed on release-19.1:
	bank.go:394,bank.go:498,acceptance.go:78,test.go:1237: stall detected at round 7, no forward progress for 30s
	cluster.go:1255,bank.go:248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1226521-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1226521-acceptance: starting............................: signal: killed
	cluster.go:953,context.go:90,cluster.go:942,asm_amd64.s:522,panic.go:397,test.go:785,test.go:775,bank.go:394,bank.go:498,acceptance.go:78,test.go:1237: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1226521-acceptance --oneshot --ignore-empty-nodes: exit status 1 3: 6345
		1: 5653
		4: dead
		2: 5448
		Error:  4: dead

@tbg
Copy link
Member

tbg commented Apr 23, 2019

The last one is a stall during roachprod start. Higher up we're seeing roachprod stop stall (which it never should assuming the test infra works). I suspect there's maybe something infra-i going on, but not only that.

I'm going to keep this open for now -- it reproduces every 5-10 days on average it seems, so we're due for another one soon.

@tbg tbg assigned andreimatei and unassigned vivekmenezes Apr 23, 2019
@tbg
Copy link
Member

tbg commented Apr 23, 2019

@andreimatei could you have an eye on this? The difficulty here, assuming it's not test infra, is observing what goes wrong during the start sequence.

Also potentially related to infra fix #37001

@ajwerner
Copy link
Contributor

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/73765b6d168fb999466756b112fd590747a3a8c4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266059&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	bank.go:395,bank.go:499,acceptance.go:92,test.go:1253: stall detected at round 2, no forward progress for 30s
	cluster.go:1402,bank.go:249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1266059-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1266059-acceptance: starting.............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:790,test.go:780,bank.go:395,bank.go:499,acceptance.go:92,test.go:1253: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1266059-acceptance --oneshot --ignore-empty-nodes: exit status 1 4: dead
		3: 5567
		2: 5408
		1: 5418
		Error:  4: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/24feca7a4106f08c73534e16ebb79d949a479f35

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268176&tab=buildLog

The test failed on branch=master, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:92,test.go:1251: stall detected at round 5, no forward progress for 30s
	cluster.go:1402,bank.go:257: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1268176-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1268176-acceptance: starting.............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:92,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1268176-acceptance --oneshot --ignore-empty-nodes: exit status 1 1: 6216
		3: 5323
		4: dead
		2: 5803
		Error:  4: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/84dc682eca4b11e6abaf390fc8883f32afe81fb4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1283539&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: stall detected at round 4, no forward progress for 30s
	cluster.go:1400,bank.go:257: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1283539-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1283539-acceptance: starting.............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1283539-acceptance --oneshot --ignore-empty-nodes: exit status 1 4: dead
		1: 5291
		3: 5386
		2: 5385
		Error:  4: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a65ec243adaa11ea951f02319dc2e02463461ab2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290143&tab=buildLog

The test failed on branch=master, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: stall detected at round 4, no forward progress for 30s
	cluster.go:1400,bank.go:257: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1290143-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1290143-acceptance: starting.............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1290143-acceptance --oneshot --ignore-empty-nodes: exit status 1 2: 5397
		3: 5676
		4: dead
		1: 6231
		Error:  4: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9671342fead0509bec0913bae4ae1f244660788e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1298500&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: stall detected at round 5, no forward progress for 30s
	cluster.go:1440,bank.go:256: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1298500-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1298500-acceptance: stopping and waiting..............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1298500-acceptance --oneshot --ignore-empty-nodes: exit status 1 1: 6276
		2: 5711
		3: 6282
		4: dead
		Error:  4: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c280de40c2bcab93c41fe82bef8353a5ecd95ac4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1311970&tab=buildLog

The test failed on branch=master, cloud=gce:
	bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: stall detected at round 3, no forward progress for 30s
	cluster.go:1442,bank.go:257: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1311970-acceptance:4 returned:
		stderr:
		
		stdout:
		teamcity-1311970-acceptance: starting............................: signal: killed
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:778,bank.go:403,bank.go:507,acceptance.go:94,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1311970-acceptance --oneshot --ignore-empty-nodes: exit status 1 2: 5308
		1: 6504
		3: 5453
		4: dead
		Error:  4: dead
	test.go:1237: test timed out (10m0s)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f1c9693da739fa5fc2c94d4d978fadd6710d17da

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1371441&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190703-1371441/acceptance/bank/node-restart/run_1
	bank.go:398,bank.go:502,acceptance.go:68,test_runner.go:680: stall detected at round 2, no forward progress for 30s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/f74db5e81f8eaa190a41d708a9ccafb3eba9370a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1380786&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190710-1380786/acceptance/bank/node-restart/run_1
	bank.go:398,bank.go:502,acceptance.go:68,test_runner.go:678: stall detected at round 6, no forward progress for 30s
	cluster.go:1691,bank.go:251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1562739066-02-n4cpu4:4 returned:
		stderr:
		
		stdout:
		teamcity-1562739066-02-n4cpu4: stopping and waiting..............................: signal: killed

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bd27eb358f558bb7598945318240335ebcfcdf13

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=acceptance/bank/node-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1446993&tab=buildLog

The test failed on branch=provisional_201908202216_v19.2.0-beta.20190826, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190821-1446993/acceptance/bank/node-restart/run_1
	bank.go:398,bank.go:502,acceptance.go:69,test_runner.go:673: stall detected at round 2, no forward progress for 30s

@irfansharif
Copy link
Contributor

Running on bd27eb3 I ran into the following:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x5e70446]

goroutine 68 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc0008367e0, 0x74e0620, 0xc000956270)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:181 +0x121
panic(0x6983380, 0x92382e0)
        /usr/local/Cellar/go/1.12.7/libexec/src/runtime/panic.go:522 +0x1b5
github.com/cockroachdb/cockroach/pkg/util/quotapool.(*IntAlloc).Freeze(0x0)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/util/quotapool/intpool.go:72 +0x26
main.(*cluster).Save(0xc000518780, 0x74e0620, 0xc0009562a0, 0xc0002a45a0, 0x4f, 0xc0000d0960)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:1067 +0xb7
main.(*testRunner).runWorker(0xc00022af00, 0x74e0620, 0xc000956270, 0xc0002062fc, 0x2, 0xc0009561b0, 0xc00027aa40, 0xc0004cc420, 0xc000956201, 0x6d1829e, ...)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:478 +0xb26
main.(*testRunner).Run.func2(0x74e0620, 0xc000956270)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:274 +0x226
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc00027aa70, 0xc0008367e0, 0xc00022e0e0)
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0xfb
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
        /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xa8

This then prompts the dead node detection failure as seen by the roachtest.

@irfansharif
Copy link
Contributor

irfansharif commented Sep 20, 2019

Ok I have a handle on what's happening here and #38785 (TL;DR flaky networking
causes roachprod to hang). This turned out to be pretty difficult to reproduce.
I tried all variants of running locally on my laptop, locally on GCE worker,
100+ runs across multiple roachprod clusters, etc. (though through this I learned
that we have some neat tooling in house now). This coupled with #40830 for the
previous GCE failures made for a lot of lost time here.

Below I've included roachtest logs from all the failures I've been able to
generate:

16:42:15 test.go:182: test status: starting cluster
16:42:15 cluster.go:315: > /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/bin/roachprod start irfansharif-flake:1
irfansharif-flake: starting
16:42:15 bank.go:453: round 2: client counts: (101)
16:42:16 bank.go:453: round 2: client counts: (127)
16:42:17 bank.go:453: round 2: client counts: (150)
irfansharif-flake: COCKROACH_DEV_LICENSE unset: enterprise features will be unavailable
16:42:57 bank.go:309: round 2: monkey sleeping while cluster recovers...
16:42:58 bank.go:316: round 2: not waiting for recovery due to signal that we're done
16:42:58 test.go:284: test failure: 	bank.go:431,bank.go:541,acceptance.go:69,test_runner.go:673: stall detected at round 2, no forward progress for 30s
16:42:58 bank.go:185: client 1 shutting down

Note the long stall between 16:42:17 and 16:42:57. This corresponds to our
roachprod start invocation, that just seems to have hanged. When looking at
cockroach.log and roachprod.log for irfansharif-flake:1, we see that we
did in fact start the node. journalctl.txt also tells us that an SSH
connection was made right around 16:42:15.

Here's another:

15:59:43 bank.go:270: round 34: restarting nodes [2]
15:59:43 bank.go:275: round 34: restarting 2
15:59:43 test.go:182: test status: stopping cluster
15:59:43 cluster.go:315: > /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/bin/roachprod stop irfansharif-flake:2
irfansharif-flake: stopping and waiting
15:59:44 bank.go:453: round 34: client counts: (2749)
16:08:06 bank.go:185: client 1 shutting down 
16:05:56 test.go:284: test failure: 	test_runner.go:688: test timed out (10m0s)
16:08:06 test.go:284: [not the first failure] test failure: 	cluster.go:1664,bank.go:280: cluster.StartE: context canceled
16:08:06 bank.go:548: 2951 transfers (4.2/sec) in 700.0s

The out of order timestamps are funky but I think not a problem. There's
another roachprod stop stall seen here. Same idea, roachprod.log and
the corresponding cockroach.log for irfansharif-flake:2 tell us that the
node was in fact shut down. journalctl.txt also tells us that an SSH
connection was made right around 15:59:44.

Another one:

17:06:45 bank.go:321: round 31: cluster recovered
17:06:45 bank.go:270: round 32: restarting nodes [2]
17:16:21 bank.go:309: round 32: monkey sleeping while cluster recovers...
17:06:46 bank.go:453: round 32: client counts: (2641)
17:08:41 bank.go:185: client 1 shutting down
17:13:17 test.go:284: test failure: 	test_runner.go:688: test timed out (10m0s)
17:16:21 bank.go:316: round 32: not waiting for recovery due to signal that we're done
17:16:21 bank.go:548: 5484 transfers (7.2/sec) in 759.9s

Except this time roachprod stop didn't even show up in roachprod.log.
Looking at the corresponding cockroach.log the node wasn't stopped. We just
stalled completely. journalctl.txt showed no new SSH connections made
anywhere around 17:06:45.

I also saw this failure, which was unlike the ones above, but still related:

18:54:46 test.go:182: test status: starting cluster
18:54:46 cluster.go:315: > /Users/irfansharif/Software/src/github.com/cockroachdb/cockroach/bin/roachprod start irfansharif-flake:4
irfansharif-flake: starting..
18:54:48 bank.go:310: round 28: monkey sleeping while cluster recovers...
18:54:48 bank.go:305: round 28: no progress made by clients
18:54:49 bank.go:305: round 28: no progress made by clients
18:54:50 bank.go:305: round 28: no progress made by clients
18:54:51 bank.go:305: round 28: no progress made by clients
18:54:52 bank.go:305: round 28: no progress made by clients
18:54:53 bank.go:305: round 28: no progress made by clients
18:54:54 bank.go:305: round 28: no progress made by clients
18:54:55 bank.go:305: round 28: no progress made by clients
18:54:56 bank.go:305: round 28: no progress made by clients
18:54:57 bank.go:305: round 28: no progress made by clients
18:54:58 bank.go:305: round 28: no progress made by clients
18:54:59 bank.go:305: round 28: no progress made by clients
18:55:00 bank.go:305: round 28: no progress made by clients
18:55:01 bank.go:180: client 1 shutting down
18:55:01 test.go:284: test failure: 	bank.go:421,bank.go:542,acceptance.go:69,test_runner.go:673: after 15.7s: dial tcp 35.237.3.40:26257: connect: operation timed out
18:55:01 bank.go:317: round 28: not waiting for recovery due to signal that we're done
18:55:02 cluster.go:1303: running (fast) consistency checks on node 1

Here roachprod start seemed to have worked just fine, but a client timed out?


I was pretty sure at this point this was some random infra flake. I then tried
introducing network delays from my local mac (where I'm running
{roachtest,roachprod} from against a roachprod cluster). I used Network Link
Conditioner
and introduced
50-100ms of uplink/downlink delay right around when the roachtest invokes
roachtest {start,stop}. Now I'm able to immediately reproduce the failures
above. That is the client operation timed out errors, the test timed out
context cancellation errors, and most importantly, all the roachprod {start,stop}
errors. My current understanding of this is that because everything roachprod
does, it does through SSH, we're particularly susceptible to network delays, packet
drops, etc. We've seen this before, or at least pointed to this being a problem
before, over at the issues referring to #37001. For now I think setting a time
out around our calls to roachprod (basically (*cluster).Stop(ctx, ...)
would better surface these kind of errors.

@andreimatei
Copy link
Contributor

Very interesting about Network Link Conditioner helping to reproduce this. 100ms of delay really causes our ssh invocations to get in trouble?!?
Were you able to extract any kind of signature for these problems from the ssh logs? (I believe we enable both client-side and server-side ssh logging these days).

@irfansharif
Copy link
Contributor

There's a fair bit of tweaking and waiting around for it to happen, and I found increasing the test duration to a longer 5 minutes or so more conducive to shuffle these out. Dropped packets help, higher latencies help, but yes even 100ms if timed just right can make things go wonky. I've not seen this ssh logging, I was working off an older branch. I'll check it out.

@irfansharif
Copy link
Contributor

(cross-posting from #38785 (comment)) For some more confirmation bias that it's flakey infrastructure at play:

Also if you notice, the failures for cluster-recovery and node-restart as posted above and over at #35326 happened in quick succession. (10th Jul, 02:14 and 10th Jul '19, 02:28; 21st Aug, 01:14 and 21st Aug, 01:14)

craig bot pushed a commit that referenced this issue Sep 24, 2019
40997: roachtest: deflake bank/{node-restart,cluster-recovery} r=irfansharif a=irfansharif

Fixes #38785.
Fixes #35326.

Because everything roachprod does, it does through SSH, we're
particularly susceptible to network delays, packet drops, etc. We've
seen this before, or at least pointed to this being a problem before,
over at #37001. Setting timeouts around our calls to roachprod helps to
better surface these kind of errors.

The underlying issue in #38785 and in #35326 is the fact that we're
running roachprod commands that may (reasonably) fail due to connection
issues, and we're unable to retry them safely (the underlying commands
are non-idempotent). Presently we simply fail the entire test, when
really we should be able to retry the commands. This is left unaddressed.

Release justification: Category 1: Non-production code changes
Release note: None

41029: cli: fix the demo licensing code r=rohany a=knz

Fixes #40734.
Fixes #41024.

Release justification: fixes a flaky test, fixes UX of main new feature

Before this patch, there were multiple problems with the code:

- if the license acquisition was disabled by the env var config,
  the error message would not be clear.
- the licensing code would deadlock silently on OSS-only
  builds (because the license failure channel was not written in that
  control branch).
- the error/warning messages would be interleaved on the same line as
  the input line (missing newline at start of message).
- the test code would fail when the license server is not available.
- the set up of the example database and workload would be performed
  asynchronously, with unclear signalling of when the user
  can expect to use them interactively.

After this patch:
- it's possible to override the license acquisition URL with
  COCKROACH_DEMO_LICENSE_URL, this is used in tests.
- setting up the example database, partitioning and workload is done
  before presenting the interactive prompt.
- partitioning the example database, if requested by
  --geo-partitioned-replicas, waits for license acquisition to
  complete (license acquisition remains asynchronous otherwise).
- impossible configurations are reported early(earlier).

For example:

- OSS-only builds:

```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
*
* ERROR: enterprise features are required for this demo, cannot run from OSS-only binary
*
Failed running "demo"
```

For license acquisition failures:

```
kena@kenax ~/cockroach % ./cockroach demo --geo-partitioned-replicas
error while contacting licensing server:
Get https://192.168.2.170/api/license?clusterid=5548b310-14b7-46de-8c92-30605bfe95c4&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
*
* ERROR: license acquisition was unsuccessful.
* Note: enterprise features are needed for --geo-partitioned-replicas.
*
Failed running "demo"
```

Additionally, this change fixes test flakiness that arises from an
unavailable license server.

Release note (cli change): To enable uses of `cockroach demo` with
enterprise features in firewalled network environments, it is now
possible to redirect the license acquisition with the environment
variable COCKROACH_DEMO_LICENSE_URL to a replacement server (for
example a suitably configured HTTP proxy).

Co-authored-by: irfan sharif <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@irfansharif irfansharif self-assigned this Sep 24, 2019
@craig craig bot closed this as completed in 5296a27 Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants