roachtest: tpcc/mixed-headroom/n5cpu16 failed #36983

cockroach-teamcity · 2019-04-21T16:02:56Z

SHA: https://github.com/cockroachdb/cockroach/commits/df200cbf3f407dbf349aa601ff9036b4dff88e83

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1252822&tab=buildLog

The test failed on release-2.1:
	cluster.go:1329,tpcc.go:168,cluster.go:1667,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1252822-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		 60.8     79.7     96.5    117.4 newOrder
		     33s        0           19.0           17.0     11.5     17.8     19.9     19.9 orderStatus
		     33s        0          201.8          170.0     37.7     71.3     83.9    104.9 payment
		     33s        0           16.0           17.6     83.9    570.4    570.4    570.4 stockLevel
		     34s        0           22.0           16.1     75.5    121.6    142.6    142.6 delivery
		     34s        0          201.8           92.8     67.1    109.1    113.2    117.4 newOrder
		     34s        0           24.0           17.2     12.1     26.2     31.5     31.5 orderStatus
		     34s        0          205.8          171.0     39.8     67.1     79.7     92.3 payment
		     34s        0           19.0           17.6     92.3    520.1    570.4    570.4 stockLevel
		Error: error in payment: select by customer idfail: ERROR: Expected "ROLLBACK TO SAVEPOINT COCKROACH_RESTART": current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)
		Error:  exit status 1
		: exit status 1
	cluster.go:1688,tpcc.go:178,tpcc.go:267,test.go:1237: Goexit() was called

The text was updated successfully, but these errors were encountered:

tbg · 2019-04-23T12:44:25Z

@jordanlewis mind taking a look?

cockroach-teamcity · 2019-04-25T15:01:13Z

SHA: https://github.com/cockroachdb/cockroach/commits/99306ec3e9fcbba01c05431cbf496e8b5b8954b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1260033&tab=buildLog

The test failed on master:
	tpcc.go:123,tpcc.go:267,test.go:1245: while fetching vno-pred: unexpected HTTP response from https://binaries.cockroachdb.com/cockroach-vno-pred.linux-amd64.tgz?ci=true: 404
		<html>
		<head><title>404 Not Found</title></head>
		<body>
		<h1>404 Not Found</h1>
		<ul>
		<li>Code: NoSuchKey</li>
		<li>Message: The specified key does not exist.</li>
		<li>Key: cockroach-vno-pred.linux-amd64.tgz</li>
		<li>RequestId: A9E91C0C130F8392</li>
		<li>HostId: R3/pgD6niu0QQTb+E80EJYIm7Un2C/lSNiCEiI0SslrfGkytC1WPdXuzjJjnt5MZyfbkvx33sYY=</li>
		</ul>
		<hr/>
		</body>
		</html>

jordanlewis · 2019-04-25T16:58:04Z

Hmm, the original failure is an odd one. I think it means there was an error in a statement in a transaction but the client didn't do the right thing and roll back - instead it tried to do another statement as normal. @andreimatei does that match your guess?

cockroach-teamcity · 2019-04-28T14:38:33Z

SHA: https://github.com/cockroachdb/cockroach/commits/dff4132a80e62c6c5ad603ff6c608b09419d4e3e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1264632&tab=buildLog

The test failed on branch=master, cloud=gce:
	tpcc.go:123,tpcc.go:267,test.go:1253: while fetching vno-pred: unexpected HTTP response from https://binaries.cockroachdb.com/cockroach-vno-pred.linux-amd64.tgz?ci=true: 404
		<html>
		<head><title>404 Not Found</title></head>
		<body>
		<h1>404 Not Found</h1>
		<ul>
		<li>Code: NoSuchKey</li>
		<li>Message: The specified key does not exist.</li>
		<li>Key: cockroach-vno-pred.linux-amd64.tgz</li>
		<li>RequestId: F223A4DE9F314A95</li>
		<li>HostId: HHf2tR76HnOY317Mh7nW3N6quUIfLkWrkptnO+McZYQKp0Pvwd5o68awtyeC19r18z3D3yNm4qU=</li>
		</ul>
		<hr/>
		</body>
		</html>

andreimatei · 2019-04-29T15:44:42Z

I think it means there was an error in a statement in a transaction but the client didn't do the right thing and roll back - instead it tried to do another statement as normal. @andreimatei does that match your guess?

Yup

The other failures are about us not having updated some infrastructure for 19.2 I think. I'll deal with it.

cockroach-teamcity · 2019-04-30T14:40:30Z

SHA: https://github.com/cockroachdb/cockroach/commits/efb45869b242137e5c178b10c646c3ed025fff36

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1266041&tab=buildLog

The test failed on branch=master, cloud=gce:
	tpcc.go:123,tpcc.go:267,test.go:1253: while fetching vno-pred: unexpected HTTP response from https://binaries.cockroachdb.com/cockroach-vno-pred.linux-amd64.tgz?ci=true: 404
		<html>
		<head><title>404 Not Found</title></head>
		<body>
		<h1>404 Not Found</h1>
		<ul>
		<li>Code: NoSuchKey</li>
		<li>Message: The specified key does not exist.</li>
		<li>Key: cockroach-vno-pred.linux-amd64.tgz</li>
		<li>RequestId: AABE61EA080F07BD</li>
		<li>HostId: o9EaVdetXcU5Ln+9PVItfIIZwLhJrH1vqeknSKysN0g5mw5t+kb2dtAcb2bP+A3xmNgaOXLR6Do=</li>
		</ul>
		<hr/>
		</body>
		</html>

cockroach-teamcity · 2019-05-01T16:11:35Z

SHA: https://github.com/cockroachdb/cockroach/commits/dab475d78289721714b9c3b5b649dbb4794fc01f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268212&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1476,tpcc.go:168,cluster.go:1814,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1268212-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		 0           17.0            9.5     13.1     26.2     26.2     26.2 orderStatus
		   4m28s        0          172.0           92.6     39.8     60.8     71.3     75.5 payment
		   4m28s        0           10.0            9.4    176.2    771.8    771.8    771.8 stockLevel
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		   4m29s        0           17.0            9.4     79.7    100.7    109.1    109.1 delivery
		   4m29s        0          174.0           87.0     62.9     83.9     96.5    113.2 newOrder
		   4m29s        0           18.0            9.5     10.5     17.8     28.3     28.3 orderStatus
		   4m29s        0          190.0           93.0     37.7     58.7     67.1     79.7 payment
		   4m29s        0           20.0            9.5    192.9    771.8    771.8    771.8 stockLevel
		Error: error in payment: ERROR: Expected "ROLLBACK TO SAVEPOINT COCKROACH_RESTART": current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)
		Error:  exit status 1
		: exit status 1
	cluster.go:1835,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

cockroach-teamcity · 2019-05-12T02:34:20Z

SHA: https://github.com/cockroachdb/cockroach/commits/d554884a4e474cc06213230d5ba7d757a88e9e46

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1286052&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1474,tpcc.go:168,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1286052-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		   123.9           56.5     62.9     92.3    109.1    113.2 newOrder
		   2m59s        0            9.0            6.4      9.4     15.7     15.7     15.7 orderStatus
		   2m59s        0          124.9           65.3     35.7     50.3     71.3     75.5 payment
		   2m59s        0            7.0            6.4    142.6    704.6    704.6    704.6 stockLevel
		    3m0s        0           19.0            6.3     71.3     92.3     92.3     92.3 delivery
		    3m0s        0          121.1           56.8     52.4     75.5     88.1     96.5 newOrder
		    3m0s        0           17.0            6.4     14.7     21.0     30.4     30.4 orderStatus
		    3m0s        0          120.1           65.6     28.3     46.1     62.9     79.7 payment
		    3m0s        0           13.0            6.5    201.3    771.8    771.8    771.8 stockLevel
		Error: error in payment: ERROR: Expected "ROLLBACK TO SAVEPOINT COCKROACH_RESTART": current transaction is aborted, commands ignored until end of transaction block (SQLSTATE 25P02)
		Error:  exit status 1
		: exit status 1
	cluster.go:1833,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

cockroach-teamcity · 2019-05-15T17:12:08Z

SHA: https://github.com/cockroachdb/cockroach/commits/46b82eb637a3f6a1ae563fcad7448fa5458f8298

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290161&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1474,tpcc.go:168,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1290161-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		tatus reply 0
		debug3: receive packet: type 96
		debug2: channel 0: rcvd eof
		debug2: channel 0: output open -> drain
		debug2: channel 0: obuf empty
		debug2: channel 0: close_write
		debug2: channel 0: output drain -> closed
		debug3: receive packet: type 97
		debug2: channel 0: rcvd close
		debug3: channel 0: will not send data after close
		debug2: channel 0: almost dead
		debug2: channel 0: gc: notify user
		debug2: channel 0: gc: user detached
		debug2: channel 0: send close
		debug3: send packet: type 97
		debug2: channel 0: is dead
		debug2: channel 0: garbage collecting
		debug1: channel 0: free: client-session, nchannels 1
		debug3: channel 0: status: The following connections are open:
		  #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
		
		debug3: send packet: type 1
		debug1: fd 0 clearing O_NONBLOCK
		debug1: fd 1 clearing O_NONBLOCK
		debug3: fd 2 is not O_NONBLOCK
		Transferred: sent 3404, received 256784 bytes, in 443.1 seconds
		Bytes per second: sent 7.7, received 579.5
		debug1: Exit status 1
		: exit status 1
		: exit status 1
	cluster.go:1833,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

jordanlewis · 2019-05-15T21:19:33Z

I think this is a bug in the TPCC load generator, not Cockroach, but I'm not sure.

cockroach-teamcity · 2019-05-16T01:47:01Z

SHA: https://github.com/cockroachdb/cockroach/commits/a65ec243adaa11ea951f02319dc2e02463461ab2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290143&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,tpcc.go:168,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1290143-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		atus reply 0
		debug3: receive packet: type 96
		debug2: channel 0: rcvd eof
		debug2: channel 0: output open -> drain
		debug2: channel 0: obuf empty
		debug2: channel 0: close_write
		debug2: channel 0: output drain -> closed
		debug3: receive packet: type 97
		debug2: channel 0: rcvd close
		debug3: channel 0: will not send data after close
		debug2: channel 0: almost dead
		debug2: channel 0: gc: notify user
		debug2: channel 0: gc: user detached
		debug2: channel 0: send close
		debug3: send packet: type 97
		debug2: channel 0: is dead
		debug2: channel 0: garbage collecting
		debug1: channel 0: free: client-session, nchannels 1
		debug3: channel 0: status: The following connections are open:
		  #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cc -1)
		
		debug3: send packet: type 1
		debug1: fd 0 clearing O_NONBLOCK
		debug1: fd 1 clearing O_NONBLOCK
		debug3: fd 2 is not O_NONBLOCK
		Transferred: sent 3392, received 127788 bytes, in 220.2 seconds
		Bytes per second: sent 15.4, received 580.4
		debug1: Exit status 1
		: exit status 1
		: exit status 1
	cluster.go:1833,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

tbg · 2019-05-17T14:25:38Z

Error: error in stockLevel: ERROR: communication error: rpc error: code = Canceled desc = context canceled (SQLSTATE 08006)

I need to move the ssh debug output above the error or it'll be annoying to debug these going forward.

jordanlewis · 2019-05-17T14:29:13Z

@tbg do we expect this test to pass? I'm still not really sure what it does - that error means the cluster was overloaded enough for a time that a query failed.

tbg · 2019-05-17T14:32:46Z

We expect this to pass, the headroom test intentionally runs with a much lower warehouse count than we think we support (1300+).

14:14:16 tpcc.go:271: computed headroom warehouses of 909; running mixed with v19.1.0-rc.4

jordanlewis · 2019-05-17T14:54:47Z

Oh, mixed is about versions, not headroom. OK.

nvanbenschoten · 2019-05-21T20:11:28Z

Previous two issues addressed by #37701.

cockroach-teamcity · 2019-05-22T18:29:37Z

SHA: https://github.com/cockroachdb/cockroach/commits/1810a4eaa07b412b2d0899d25bb16a28a2746d48

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1300948&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1510,tpcc.go:168,cluster.go:1848,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1300948-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		1.6    121.6 delivery
		1h35m33s        0          209.9          194.8     50.3     88.1    104.9    109.1 newOrder
		1h35m33s        0           18.0           19.5      8.4     15.2     15.7     15.7 orderStatus
		1h35m33s        0          196.9          195.2     23.1     52.4     58.7     92.3 payment
		1h35m33s        0           29.0           19.5     46.1     88.1     92.3     92.3 stockLevel
		1h35m34s        0           21.0           19.5     65.0    109.1    109.1    109.1 delivery
		1h35m34s        0          204.0          194.8     48.2     75.5     83.9    104.9 newOrder
		1h35m34s        0           19.0           19.5      9.4     16.3     41.9     41.9 orderStatus
		1h35m34s        0          207.0          195.2     22.0     37.7     48.2     67.1 payment
		1h35m34s        0           19.0           19.5     48.2     92.3    100.7    100.7 stockLevel
		Error: error in newOrder: missing stock row
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_34.74.202.225_2019-05-22T16:23:53Z: exit status 1
		: exit status 1
	cluster.go:1869,tpcc.go:178,tpcc.go:272,test.go:1251: Goexit() was called

nvanbenschoten · 2019-05-22T19:22:12Z

error in newOrder: missing stock row

@ajwerner this looks like the issue you just fixed. Has that change been backported to 19.1 yet?

ajwerner · 2019-05-22T20:24:40Z

s/@ajwerner /@ajkr /

jeffrey-xiao · 2019-06-26T20:36:29Z

The most recent failures are due to the non-nullable fields in RangeDescriptor. The old nodes would fail the CPut because we don't preserve unknown fields.

38465: roachpb: Change RangeDescriptor.StickyBit and ReplicaDescriptor.Type to nullable r=jeffrey-xiao a=jeffrey-xiao On a mixed version cluster, only having #38302 does not guarantee backwards compatibility for non-nullable fields in RangeDescriptor. Suppose the replicas of a range are of mixed version and the current leaseholder of the range is on newest version. If the leaseholder splits that range and writes to the non-nullable fields in RangeDescriptor and the leaseholder transfers to a node on an older version without #38302, then all CPuts would fail on that node. We must guarantee that #38302 is on all nodes before making the fields in RangeDescriptor non-nullable. Fixes #36983. Fixes #38471. Release note: None Co-authored-by: Jeffrey Xiao <[email protected]>

cockroach-teamcity · 2019-06-29T01:47:09Z

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,tpcc.go:271,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364821-tpcc-mixed-headroom-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190628 20:39:25.890763 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190628 20:39:58.457979 100 ccl/workloadccl/fixture.go:476  loaded 53 KiB table warehouse in 32.566814041s (1000 rows, 0 index entries, 1.6 KiB)
		: signal: interrupt
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,tpcc.go:271,test.go:1249: context canceled

cockroach-teamcity · 2019-07-01T01:48:03Z

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,tpcc.go:271,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-tpcc-mixed-headroom-n5cpu16:5 -- ./workload fixtures load tpcc --warehouses=1000  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190630 21:25:37.588769 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190630 21:25:38.663084 122 ccl/workloadccl/fixture.go:476  loaded 7.8 MiB table item in 1.073816944s (100000 rows, 0 index entries, 7.2 MiB)
		I190630 21:25:48.892337 116 ccl/workloadccl/fixture.go:476  loaded 53 KiB table warehouse in 11.303133494s (1000 rows, 0 index entries, 4.7 KiB)
		I190630 21:27:52.817663 117 ccl/workloadccl/fixture.go:476  loaded 1006 KiB table district in 2m15.228537539s (10000 rows, 0 index entries, 7.4 KiB)
		I190630 21:36:49.056856 119 ccl/workloadccl/fixture.go:476  loaded 4.3 GiB table history in 11m11.467897932s (30000000 rows, 60000000 index entries, 6.5 MiB)
		I190630 21:39:49.067750 120 ccl/workloadccl/fixture.go:476  loaded 1.3 GiB table order in 14m11.478710462s (30000000 rows, 30000000 index entries, 1.6 MiB)
		I190630 21:40:13.594779 121 ccl/workloadccl/fixture.go:476  loaded 126 MiB table new_order in 14m36.0054959s (9000000 rows, 0 index entries, 147 KiB)
		: signal: killed
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,tpcc.go:271,test.go:1249: context canceled

@asubiotto

38593: storage: sort InternalReplicas before comparing r=jeffrey-xiao a=jeffrey-xiao Fixes an issue where the in-memory copy of RangeDescriptor is out-of-sync with the copy in kv. Should resolve the failure in this #36983 (comment) The import step of `tpcc/mixed-headroom/n5cpu16` passes on `2811e0f` + #38465 + this PR. cc @asubiotto Co-authored-by: Jeffrey Xiao <[email protected]>

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A for of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten Fixes #34180. Fixes #35493. Fixes #36983. Fixes #37108. Fixes #37371. Fixes #37384. Fixes #37551. Fixes #37879. Fixes #38095. Fixes #38131. Fixes #38136. Fixes #38549. Fixes #38552. Fixes #38555. Fixes #38560. Fixes #38562. Fixes #38563. Fixes #38569. Fixes #38578. Fixes #38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: ![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png) We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Co-authored-by: Nathan VanBenschoten <[email protected]>

cockroach-teamcity · 2019-07-03T13:54:41Z

SHA: https://github.com/cockroachdb/cockroach/commits/2c865eeb3e3b244468ffc509a62778bd1f46740f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=tpcc/mixed-headroom/n5cpu16 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1370685&tab=buildLog

The test failed on branch=provisional_201907021644_v19.2.0-alpha.20190701, cloud=gce:
	cluster.go:1511,tpcc.go:167,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1370685-tpcc-mixed-headroom-n5cpu16:5 -- ./workload run tpcc --warehouses=909 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		      0.0      0.0      0.0 delivery
		   4m25s        0            0.0           17.4      0.0      0.0      0.0      0.0 newOrder
		   4m25s        0            0.0            2.1      0.0      0.0      0.0      0.0 orderStatus
		   4m25s        0            0.0           18.6      0.0      0.0      0.0      0.0 payment
		   4m25s        0            0.0            0.2      0.0      0.0      0.0      0.0 stockLevel
		   4m26s        0            0.0            1.9      0.0      0.0      0.0      0.0 delivery
		   4m26s        0            0.0           17.4      0.0      0.0      0.0      0.0 newOrder
		   4m26s        0            0.0            2.1      0.0      0.0      0.0      0.0 orderStatus
		   4m26s        0            0.0           18.6      0.0      0.0      0.0      0.0 payment
		   4m26s        0            0.0            0.2      0.0      0.0      0.0      0.0 stockLevel
		Error: error in payment: EOF
		Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.202.76.107_2019-07-03T05:49:56Z: exit status 1
		: exit status 1
	cluster.go:1870,tpcc.go:177,tpcc.go:271,test.go:1249: unexpected node event: 3: dead
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpcc.go:177,tpcc.go:271,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1370685-tpcc-mixed-headroom-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: 3748
		4: 3825
		3: dead
		1: 3953
		Error:  3: dead

nvanbenschoten · 2019-07-03T14:01:56Z

n3 died but we failed to grab any logs from the nodes. We just see context canceled all the way down:

13:52:24 cluster.go:992: fetching debug zip
13:52:24 cluster.go:344: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod ssh teamcity-1370685-tpcc-mixed-headroom-n5cpu16:1 -- ./cockroach debug zip --url {pgurl:1} debug.zip
13:52:24 cluster.go:1006: ./cockroach debug zip failed: 
13:52:24 test.go:1193: failed to download debug zip: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod ssh teamcity-1370685-tpcc-mixed-headroom-n5cpu16:1 -- ./cockroach debug zip --url {pgurl:1} debug.zip: context canceled

It looks like that's because there was an 8-hour gap between detecting the crash and trying to grab any artifacts:

05:54:36 test.go:808: test failure: 	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpcc.go:177,tpcc.go:271,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1370685-tpcc-mixed-headroom-n5cpu16 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: 3748
		4: 3825
		3: dead
		1: 3953
		Error:  3: dead
05:54:48 cluster.go:1102: running (fast) consistency checks on node 2
13:52:08 cluster.go:270: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod destroy teamcity-1370685-tpcc-mixed-headroom-n5cpu16

cockroach-teamcity added this to the 19.1 milestone Apr 21, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Apr 21, 2019

tbg assigned jordanlewis Apr 23, 2019

This comment has been minimized.

Sign in to view

jeffrey-xiao self-assigned this Jun 26, 2019

jeffrey-xiao mentioned this issue Jun 26, 2019

roachpb: Change RangeDescriptor.StickyBit and ReplicaDescriptor.Type to nullable #38465

Merged

This comment has been minimized.

Sign in to view

craig bot closed this as completed in #38465 Jun 27, 2019

jeffrey-xiao reopened this Jun 27, 2019

This comment has been minimized.

Sign in to view

asubiotto mentioned this issue Jun 30, 2019

release: v19.2.0-alpha.20190701 #38381

Closed

14 tasks

jeffrey-xiao mentioned this issue Jul 1, 2019

storage: sort InternalReplicas before comparing #38593

Merged

This comment has been minimized.

Sign in to view

nvanbenschoten mentioned this issue Jul 3, 2019

storage: release quota on failed Raft proposals #38632

Merged

craig bot closed this as completed in #38632 Jul 3, 2019

nvanbenschoten mentioned this issue Sep 18, 2019

roachtest: tpcc/mixed-headroom/n5cpu16 failed #40763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpcc/mixed-headroom/n5cpu16 failed #36983

roachtest: tpcc/mixed-headroom/n5cpu16 failed #36983

cockroach-teamcity commented Apr 21, 2019

tbg commented Apr 23, 2019

cockroach-teamcity commented Apr 25, 2019

jordanlewis commented Apr 25, 2019

cockroach-teamcity commented Apr 28, 2019

andreimatei commented Apr 29, 2019

cockroach-teamcity commented Apr 30, 2019

cockroach-teamcity commented May 1, 2019

cockroach-teamcity commented May 12, 2019

cockroach-teamcity commented May 15, 2019

jordanlewis commented May 15, 2019

cockroach-teamcity commented May 16, 2019

tbg commented May 17, 2019

jordanlewis commented May 17, 2019

tbg commented May 17, 2019

jordanlewis commented May 17, 2019

This comment has been minimized.

This comment has been minimized.

nvanbenschoten commented May 21, 2019

cockroach-teamcity commented May 22, 2019

nvanbenschoten commented May 22, 2019

ajwerner commented May 22, 2019

This comment has been minimized.

jeffrey-xiao commented Jun 26, 2019

This comment has been minimized.

cockroach-teamcity commented Jun 29, 2019

This comment has been minimized.

cockroach-teamcity commented Jul 1, 2019

This comment has been minimized.

cockroach-teamcity commented Jul 3, 2019

nvanbenschoten commented Jul 3, 2019 •

edited

Loading

roachtest: tpcc/mixed-headroom/n5cpu16 failed #36983

roachtest: tpcc/mixed-headroom/n5cpu16 failed #36983

Comments

cockroach-teamcity commented Apr 21, 2019

tbg commented Apr 23, 2019

cockroach-teamcity commented Apr 25, 2019

jordanlewis commented Apr 25, 2019

cockroach-teamcity commented Apr 28, 2019

andreimatei commented Apr 29, 2019

cockroach-teamcity commented Apr 30, 2019

cockroach-teamcity commented May 1, 2019

cockroach-teamcity commented May 12, 2019

cockroach-teamcity commented May 15, 2019

jordanlewis commented May 15, 2019

cockroach-teamcity commented May 16, 2019

tbg commented May 17, 2019

jordanlewis commented May 17, 2019

tbg commented May 17, 2019

jordanlewis commented May 17, 2019

This comment has been minimized.

This comment has been minimized.

nvanbenschoten commented May 21, 2019

cockroach-teamcity commented May 22, 2019

nvanbenschoten commented May 22, 2019

ajwerner commented May 22, 2019

This comment has been minimized.

jeffrey-xiao commented Jun 26, 2019

This comment has been minimized.

cockroach-teamcity commented Jun 29, 2019

This comment has been minimized.

cockroach-teamcity commented Jul 1, 2019

This comment has been minimized.

cockroach-teamcity commented Jul 3, 2019

nvanbenschoten commented Jul 3, 2019 • edited Loading

nvanbenschoten commented Jul 3, 2019 •

edited

Loading