roachtest: scrub/all-checks/tpcc/w=100 failed #37108

cockroach-teamcity · 2019-04-25T01:55:55Z

SHA: https://github.com/cockroachdb/cockroach/commits/7109d291e3b9edfa38264361f832cec14fff66ee

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1259219&tab=buildLog

The test failed on release-19.1:
	scrub.go:83,cluster.go:1761,errgroup.go:57: dial tcp 35.188.39.106:26257: connect: connection refused
	cluster.go:1423,tpcc.go:168,cluster.go:1761,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1259219-scrub-all-checks-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=30m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		l
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  10m49s    26544            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  10m49s    26544            0.0            0.5      0.0      0.0      0.0      0.0 newOrder
		  10m49s    26544            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  10m49s    26544            0.0            0.3      0.0      0.0      0.0      0.0 payment
		  10m49s    26544            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		  10m50s    26544            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  10m50s    26544            0.0            0.5      0.0      0.0      0.0      0.0 newOrder
		  10m50s    26544            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  10m50s    26544            0.0            0.3      0.0      0.0      0.0      0.0 payment
		  10m50s    26544            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1782,tpcc.go:178,scrub.go:58,test.go:1245: Goexit() was called
	cluster.go:1016,context.go:89,cluster.go:1005,asm_amd64.s:522,panic.go:397,test.go:790,test.go:776,cluster.go:1782,tpcc.go:178,scrub.go:58,test.go:1245: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1259219-scrub-all-checks-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		1: dead
		3: 3213
		2: 3189
		4: 3156
		Error:  1: dead

The text was updated successfully, but these errors were encountered:

andreimatei · 2019-04-25T14:42:33Z

Node 1 OOMed.

Check out the heap profile: we have like 2GB in ScanResponses.
memprof.fraction_system_memory.000000013453201408_2019-04-25T01_48_08.15.zip

andreimatei · 2019-04-25T14:44:03Z

cc @lucy-zhang

andreimatei · 2019-04-25T14:44:59Z

Note that this test seems to only be running with 100 warehouses now, so much smaller data than it used to. So the OOM is really surprising I guess.

thoszhang · 2019-04-25T16:20:37Z

It seems like there was something very wrong with this cluster even before the SCRUB runs started. Here are the workload stats at the end of the tpcc ramp period:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   4m57s        0            1.0            0.4  40802.2  40802.2  40802.2  40802.2 delivery
   4m57s        0            0.0            4.1      0.0      0.0      0.0      0.0 newOrder
   4m57s        0            0.0            0.5      0.0      0.0      0.0      0.0 orderStatus
   4m57s        0            0.0            3.2      0.0      0.0      0.0      0.0 payment
   4m57s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
   4m58s        0            0.0            0.4      0.0      0.0      0.0      0.0 delivery
   4m58s        0            1.0            4.1 103079.2 103079.2 103079.2 103079.2 newOrder
   4m58s        0            2.0            0.5   4563.4   6710.9   6710.9   6710.9 orderStatus
   4m58s        0            0.0            3.2      0.0      0.0      0.0      0.0 payment
   4m58s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
   4m59s        0            2.0            0.4  57982.1  81604.4  81604.4  81604.4 delivery
   4m59s        0            3.0            4.1  26843.5  66572.0  66572.0  66572.0 newOrder
   4m59s        0            0.0            0.5      0.0      0.0      0.0      0.0 orderStatus
   4m59s        0            1.0            3.2  16643.0  16643.0  16643.0  16643.0 payment
   4m59s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
    5m0s        0            0.0            0.4      0.0      0.0      0.0      0.0 delivery
    5m0s        0            1.0            4.1  38654.7  38654.7  38654.7  38654.7 newOrder
    5m0s        0            0.0            0.5      0.0      0.0      0.0      0.0 orderStatus
    5m0s        0            3.0            3.2  38654.7 103079.2 103079.2 103079.2 payment
    5m0s        0            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel

cockroach-teamcity · 2019-04-28T14:41:35Z

SHA: https://github.com/cockroachdb/cockroach/commits/dff4132a80e62c6c5ad603ff6c608b09419d4e3e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1264632&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1835,tpcc.go:178,scrub.go:58,test.go:1253: unexpected node event: 3: dead
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:790,test.go:776,cluster.go:1835,tpcc.go:178,scrub.go:58,test.go:1253: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1264632-scrub-all-checks-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		3: dead
		4: 4043
		2: 4129
		1: 4274
		Error:  3: dead

cockroach-teamcity · 2019-05-15T14:13:18Z

SHA: https://github.com/cockroachdb/cockroach/commits/a65ec243adaa11ea951f02319dc2e02463461ab2

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1290143&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1474,tpcc.go:168,cluster.go:1812,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1290143-scrub-all-checks-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=30m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		l
		_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
		  16m13s    25997            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  16m13s    25997            0.0            0.6      0.0      0.0      0.0      0.0 newOrder
		  16m13s    25997            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  16m13s    25997            0.0            0.7      0.0      0.0      0.0      0.0 payment
		  16m13s    25997            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		  16m14s    25997            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  16m14s    25997            0.0            0.6      0.0      0.0      0.0      0.0 newOrder
		  16m14s    25997            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  16m14s    25997            1.0            0.7 103079.2 103079.2 103079.2 103079.2 payment
		  16m14s    25997            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1833,tpcc.go:178,scrub.go:58,test.go:1251: unexpected node event: 2: dead
	cluster.go:1038,context.go:89,cluster.go:1027,asm_amd64.s:522,panic.go:397,test.go:788,test.go:774,cluster.go:1833,tpcc.go:178,scrub.go:58,test.go:1251: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1290143-scrub-all-checks-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: dead
		3: 4134
		4: 4106
		1: 4166
		Error:  2: dead

nvanbenschoten · 2019-05-21T20:12:05Z

Previous two issues addressed by #37701.

cockroach-teamcity · 2019-06-07T15:07:45Z

SHA: https://github.com/cockroachdb/cockroach/commits/8892e379d84a36b29003420189edd1e10db41d71

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1328407&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1513,tpcc.go:169,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1328407-scrub-all-checks-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=30m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		   27359            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		  17m54s    27359            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  17m54s    27359            1.0            0.6  38654.7  38654.7  38654.7  38654.7 newOrder
		  17m54s    27359            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  17m54s    27359            2.0            0.5 103079.2 103079.2 103079.2 103079.2 payment
		  17m54s    27359            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		  17m55s    27359            0.0            0.1      0.0      0.0      0.0      0.0 delivery
		  17m55s    27359            0.0            0.6      0.0      0.0      0.0      0.0 newOrder
		  17m55s    27359            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		  17m55s    27359            1.0            0.5 103079.2 103079.2 103079.2 103079.2 payment
		  17m55s    27359            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1872,tpcc.go:179,scrub.go:55,test.go:1248: unexpected node event: 4: dead
	cluster.go:1035,context.go:87,cluster.go:1024,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1872,tpcc.go:179,scrub.go:55,test.go:1248: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1328407-scrub-all-checks-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		4: dead
		1: 4154
		2: 4289
		3: 4126
		Error:  4: dead

cockroach-teamcity · 2019-06-09T01:41:56Z

SHA: https://github.com/cockroachdb/cockroach/commits/cbd571c7bf2ffad3514334a410fa7a728b1f5bf0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1330352&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	scrub.go:80,cluster.go:1851,errgroup.go:57: dial tcp 35.196.209.187:26257: connect: connection refused
	cluster.go:1513,tpcc.go:169,cluster.go:1851,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1330352-scrub-all-checks-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=30m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		     0.0 newOrder
		   5m42s    59989            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		   5m42s    59989            0.0            0.5      0.0      0.0      0.0      0.0 payment
		   5m42s    59989            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		E190608 14:42:53.003475 1 workload/cli/run.go:425  error in newOrder: dial tcp 10.142.0.144:26257: connect: connection refused
		   5m43s    98343            1.0            0.1  38654.7  38654.7  38654.7  38654.7 delivery
		   5m43s    98343            3.0            0.4  36507.2  36507.2  36507.2  36507.2 newOrder
		   5m43s    98343            0.0            0.1      0.0      0.0      0.0      0.0 orderStatus
		   5m43s    98343            2.0            0.5   6174.0 103079.2 103079.2 103079.2 payment
		   5m43s    98343            0.0            0.0      0.0      0.0      0.0      0.0 stockLevel
		E190608 14:42:54.003542 1 workload/cli/run.go:425  error in newOrder: dial tcp 10.142.0.144:26257: connect: connection refused
		: signal: killed
	cluster.go:1872,tpcc.go:179,scrub.go:55,test.go:1248: Goexit() was called
	cluster.go:1035,context.go:87,cluster.go:1024,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1872,tpcc.go:179,scrub.go:55,test.go:1248: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1330352-scrub-all-checks-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		3: 4290
		2: 4095
		1: dead
		4: 4196
		Error:  1: dead

cockroach-teamcity · 2019-06-16T01:42:46Z

SHA: https://github.com/cockroachdb/cockroach/commits/cee0b362d9f982446444b4a704db86e0cbac7c6d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1339390&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	scrub.go:80,cluster.go:1851,errgroup.go:57: dial tcp 35.237.184.40:26257: connect: connection refused
	cluster.go:1872,tpcc.go:179,scrub.go:55,test.go:1248: Goexit() was called
	cluster.go:1035,context.go:87,cluster.go:1024,asm_amd64.s:522,panic.go:397,test.go:785,test.go:771,cluster.go:1872,tpcc.go:179,scrub.go:55,test.go:1248: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1339390-scrub-all-checks-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: 3976
		3: 3282
		4: 3370
		1: dead
		Error:  1: dead

cockroach-teamcity · 2019-06-27T14:55:49Z

SHA: https://github.com/cockroachdb/cockroach/commits/c2d46a6f4f37ae96398230b4c8893e6683b478d5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1362380&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1511,tpcc.go:167,cluster.go:1849,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1362380-scrub-all-checks-tpcc-w-100:5 -- ./workload run tpcc --warehouses=100 --histograms=logs/stats.json --wait=false --tolerate-errors --ramp=5m0s --duration=30m0s {pgurl:1-4} returned:
		stderr:
		
		stdout:
		   25691            0.0            1.5      0.0      0.0      0.0      0.0 stockLevel
		  28m51s    25691            0.0            1.6      0.0      0.0      0.0      0.0 delivery
		  28m51s    25691            0.0           16.2      0.0      0.0      0.0      0.0 newOrder
		  28m51s    25691            0.0            1.6      0.0      0.0      0.0      0.0 orderStatus
		  28m51s    25691            0.0           16.1      0.0      0.0      0.0      0.0 payment
		  28m51s    25691            0.0            1.5      0.0      0.0      0.0      0.0 stockLevel
		  28m52s    25691            0.0            1.6      0.0      0.0      0.0      0.0 delivery
		  28m52s    25691            0.0           16.2      0.0      0.0      0.0      0.0 newOrder
		  28m52s    25691            0.0            1.6      0.0      0.0      0.0      0.0 orderStatus
		  28m52s    25691            0.0           16.1      0.0      0.0      0.0      0.0 payment
		  28m52s    25691            0.0            1.5      0.0      0.0      0.0      0.0 stockLevel
		: signal: killed
	cluster.go:1870,tpcc.go:177,scrub.go:53,test.go:1249: unexpected node event: 2: dead
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpcc.go:177,scrub.go:53,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1362380-scrub-all-checks-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: dead
		4: 3418
		1: 3862
		3: 3487
		Error:  2: dead

nvanbenschoten · 2019-06-27T21:24:44Z

fatal error: runtime: out of memory just like above.

cockroach-teamcity · 2019-06-28T15:07:05Z

SHA: https://github.com/cockroachdb/cockroach/commits/9693537af8c03f48344fb7773916af2113b0406c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364839&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1870,tpcc.go:177,scrub.go:53,test.go:1249: unexpected node event: 2: dead
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,tpcc.go:177,scrub.go:53,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1364839-scrub-all-checks-tpcc-w-100 --oneshot --ignore-empty-nodes: exit status 1 5: skipped
		2: dead
		4: 4128
		1: 3640
		3: 3452
		Error:  2: dead

nvanbenschoten · 2019-06-28T16:31:49Z

^ fatal error: runtime: out of memory same thing.

cockroach-teamcity · 2019-06-28T17:42:57Z

SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog

The test failed on branch=provisional_201906271846_v19.2.0-alpha.20190701, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1364443-scrub-all-checks-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190628 10:32:37.003698 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190628 10:32:38.321665 61 ccl/workloadccl/fixture.go:476  loaded 5.1 KiB table warehouse in 1.317764465s (100 rows, 0 index entries, 3.9 KiB)
		I190628 10:32:40.064446 98 ccl/workloadccl/fixture.go:476  loaded 11 MiB table new_order in 3.060415496s (900000 rows, 0 index entries, 3.6 MiB)
		: signal: interrupt
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: context canceled

cockroach-teamcity · 2019-07-01T01:48:48Z

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1367379-scrub-all-checks-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190630 19:32:13.420684 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		: signal: interrupt
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: context canceled

cockroach-teamcity · 2019-07-02T01:48:56Z

SHA: https://github.com/cockroachdb/cockroach/commits/ca1ef4d4f8296b213c0b2b140f16e4a97931e6e7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scrub/all-checks/tpcc/w=100 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1368144&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1368144-scrub-all-checks-tpcc-w-100:5 -- ./workload fixtures load tpcc --warehouses=100  {pgurl:1} returned:
		stderr:
		
		stdout:
		I190701 20:53:40.096824 1 ccl/workloadccl/cliccl/fixtures.go:293  starting load of 9 tables
		I190701 20:53:42.724934 88 ccl/workloadccl/fixture.go:476  loaded 11 MiB table new_order in 2.627582584s (900000 rows, 0 index entries, 4.2 MiB)
		I190701 20:54:26.592491 89 ccl/workloadccl/fixture.go:476  loaded 7.8 MiB table item in 46.495077135s (100000 rows, 0 index entries, 171 KiB)
		I190701 20:55:19.705835 83 ccl/workloadccl/fixture.go:476  loaded 5.1 KiB table warehouse in 1m39.608444433s (100 rows, 0 index entries, 52 B)
		: signal: interrupt
	cluster.go:1587,cluster.go:1606,cluster.go:1710,cluster.go:1093,context.go:122,cluster.go:1090,panic.go:406,test.go:783,test.go:769,cluster.go:1511,tpcc.go:156,tpcc.go:158,scrub.go:53,test.go:1249: context canceled

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A for of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

Fixes cockroachdb#34180. Fixes cockroachdb#35493. Fixes cockroachdb#36983. Fixes cockroachdb#37108. Fixes cockroachdb#37371. Fixes cockroachdb#37384. Fixes cockroachdb#37551. Fixes cockroachdb#37879. Fixes cockroachdb#38095. Fixes cockroachdb#38131. Fixes cockroachdb#38136. Fixes cockroachdb#38549. Fixes cockroachdb#38552. Fixes cockroachdb#38555. Fixes cockroachdb#38560. Fixes cockroachdb#38562. Fixes cockroachdb#38563. Fixes cockroachdb#38569. Fixes cockroachdb#38578. Fixes cockroachdb#38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when Replica.propose fails. This used to happen [here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: <image> We see that the leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Release note: None

38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten Fixes #34180. Fixes #35493. Fixes #36983. Fixes #37108. Fixes #37371. Fixes #37384. Fixes #37551. Fixes #37879. Fixes #38095. Fixes #38131. Fixes #38136. Fixes #38549. Fixes #38552. Fixes #38555. Fixes #38560. Fixes #38562. Fixes #38563. Fixes #38569. Fixes #38578. Fixes #38600. _A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._ This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite. I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1 2.00 metrics requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following: ![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png) We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case. The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error: ``` received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder) ``` Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import. Co-authored-by: Nathan VanBenschoten <[email protected]>

cockroach-teamcity added this to the 19.2 milestone Apr 25, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Apr 25, 2019

thoszhang mentioned this issue Apr 29, 2019

roachtest: schemachange/index/tpcc/w=100 failed #36024

Closed

This comment has been minimized.

Sign in to view

nvanbenschoten assigned thoszhang Jun 28, 2019

nvanbenschoten mentioned this issue Jul 3, 2019

storage: release quota on failed Raft proposals #38632

Merged

craig bot closed this as completed in #38632 Jul 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: scrub/all-checks/tpcc/w=100 failed #37108

roachtest: scrub/all-checks/tpcc/w=100 failed #37108

cockroach-teamcity commented Apr 25, 2019

andreimatei commented Apr 25, 2019

andreimatei commented Apr 25, 2019

andreimatei commented Apr 25, 2019

thoszhang commented Apr 25, 2019

cockroach-teamcity commented Apr 28, 2019

cockroach-teamcity commented May 15, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

nvanbenschoten commented May 21, 2019

cockroach-teamcity commented Jun 7, 2019

cockroach-teamcity commented Jun 9, 2019

cockroach-teamcity commented Jun 16, 2019

cockroach-teamcity commented Jun 27, 2019

nvanbenschoten commented Jun 27, 2019

cockroach-teamcity commented Jun 28, 2019

nvanbenschoten commented Jun 28, 2019

cockroach-teamcity commented Jun 28, 2019

cockroach-teamcity commented Jul 1, 2019

cockroach-teamcity commented Jul 2, 2019

roachtest: scrub/all-checks/tpcc/w=100 failed #37108

roachtest: scrub/all-checks/tpcc/w=100 failed #37108

Comments

cockroach-teamcity commented Apr 25, 2019

andreimatei commented Apr 25, 2019

andreimatei commented Apr 25, 2019

andreimatei commented Apr 25, 2019

thoszhang commented Apr 25, 2019

cockroach-teamcity commented Apr 28, 2019

cockroach-teamcity commented May 15, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

nvanbenschoten commented May 21, 2019

cockroach-teamcity commented Jun 7, 2019

cockroach-teamcity commented Jun 9, 2019

cockroach-teamcity commented Jun 16, 2019

cockroach-teamcity commented Jun 27, 2019

nvanbenschoten commented Jun 27, 2019

cockroach-teamcity commented Jun 28, 2019

nvanbenschoten commented Jun 28, 2019

cockroach-teamcity commented Jun 28, 2019

cockroach-teamcity commented Jul 1, 2019

cockroach-teamcity commented Jul 2, 2019