Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: import/tpch/nodes=32 failed #35493

Closed
cockroach-teamcity opened this issue Mar 7, 2019 · 13 comments · Fixed by #38632
Closed

roachtest: import/tpch/nodes=32 failed #35493

cockroach-teamcity opened this issue Mar 7, 2019 · 13 comments · Fixed by #38632
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/f55596ea8c2bca016d036ec9399b80c17e7cfe93

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1165148&tab=buildLog

The test failed on release-2.1:
	test.go:1202: test timed out (3h0m0s)
	cluster.go:1603,import.go:150,test.go:1214: context canceled
	test.go:978,asm_amd64.s:523,panic.go:513,log.go:219,test.go:1160,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1603,import.go:150,test.go:1214: write /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190307-1165148/import/tpch/nodes=32/test.log: file already closed

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Mar 7, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4f23ef547ad7af684f7b8cc349be8c1dc4d30aa3

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1204603&tab=buildLog

The test failed on release-2.1:
	test.go:1204: test timed out (3h0m0s)
	cluster.go:1626,import.go:150,test.go:1216: context canceled
	test.go:980,asm_amd64.s:523,panic.go:513,log.go:219,cluster.go:926,context.go:90,cluster.go:916,test.go:1161,asm_amd64.s:522,panic.go:397,test.go:774,test.go:760,cluster.go:1626,import.go:150,test.go:1216: write /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/20190328-1204603/import/tpch/nodes=32/test.log: file already closed

@tbg
Copy link
Member

tbg commented Apr 1, 2019

@ajwerner are you backporting your replication hardening PRs to release-2.1? Asking because these test failures have unavailable ranges and that typically means we made a bad decision somewhere.

@tbg
Copy link
Member

tbg commented Apr 1, 2019

(if so, mind taking over this issue and closing when the backports have happened? Thanks!)

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1a5eabad4511a3371a6b2809d2bfc29e8aff66a6

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1224702&tab=buildLog

The test failed on master:
	cluster.go:1255,import.go:88,test.go:1228: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1224702-import-tpch-nodes-32 --encrypt returned:
		stderr:
		
		stdout:
		t --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-east1,zone=us-east1-b --join=35.227.98.255:26257 --enterprise-encryption=path=/mnt/data1/cockroach,key=/mnt/data1/cockroach/aes-128.key,old-key=plain >> ${HOME}/logs/cockroach.stdout.log 2>> ${HOME}/logs/cockroach.stderr.log || (x=$?; cat ${HOME}/logs/cockroach.stderr.log; exit $x)
		Connection to 35.190.146.42 closed by remote host.
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:397
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1417
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190406 07:43:05.483034 1 cluster_synced.go:1499  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e1b3d0fdf11783203e76e7a2b3add59e8562a58d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1249112&tab=buildLog

The test failed on master:
	cluster.go:1107,import.go:124,test.go:1237: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-1249112-import-tpch-nodes-32 /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 ./cockroach returned:
		stderr:
		
		stdout:
		teamcity-1249112-import-tpch-nodes-32: putting (dist) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 ./cockroach
		...............
		   1: done
		   2: done
		   3: done
		   4: done
		   5: done
		   6: done
		   7: done
		   8: done
		   9: done
		  10: done
		  11: done
		  12: done
		  13: done
		  14: done
		  15: done
		  16: done
		  17: done
		  18: done
		  19: done
		  20: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine [email protected]:./cockroach [email protected]:./cockroach
		ssh_exchange_identification: read: Connection reset by peer
		: exit status 1
		  21: done
		  22: done
		  23: done
		  24: done
		  25: done
		  26: done
		  27: done
		  28: done
		  29: done
		  30: done
		  31: done
		  32: done
		I190419 08:02:05.707056 1 cluster_synced.go:965  put /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c8bda1de440cfe90cf23a433119d77795cfa0047

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1292152&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	cluster.go:1833,import.go:192,test.go:1251: pq: communication error: rpc error: code = Canceled desc = context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/923a3b2a6f4a6492883141092280d1041de1381a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1295056&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1400,import.go:130,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1295056-import-tpch-nodes-32 returned:
		stderr:
		
		stdout:
		ithub.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1461
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333
		~ ./cockroach version
		Connection to 35.227.17.104 closed by remote host.
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.getCockroachVersion
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:95
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:289
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1461
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190518 07:59:52.570912 1 cluster_synced.go:1543  command failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bf87ee9d6d5d75cb0ce3bc814fc28f9d16b8ce9d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1303735&tab=buildLog

The test failed on branch=release-2.1, cloud=gce:
	cluster.go:1875,import.go:194,test.go:1251: pq: communication error: rpc error: code = Canceled desc = context canceled

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3b63648f905715e6c0b055fe9acac9c5b8206196

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1333520&tab=buildLog

The test failed on branch=master, cloud=gce:
	cluster.go:1293,import.go:126,test.go:1248: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod put teamcity-1333520-import-tpch-nodes-32 /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 ./cockroach returned:
		stderr:
		
		stdout:
		key: /root/.ssh/google_compute_engine
		debug1: Authentication succeeded (publickey).
		Authenticated to 104.196.218.227 ([104.196.218.227]:22).
		debug1: channel 0: new [client-session]
		debug1: Requesting [email protected]
		debug1: Entering interactive session.
		debug1: pledge: network
		debug1: channel 0: free: client-session, nchannels 1
		debug1: fd 1 clearing O_NONBLOCK
		Connection to 104.196.218.227 closed by remote host.
		Transferred: sent 2308, received 1408 bytes, in 0.0 seconds
		Bytes per second: sent 26890149.0, received 16404389.0
		debug1: Exit status -1
		: exit status 1
		   8: done
		   9: done
		  10: done
		  11: done
		  12: done
		  13: done
		  14: done
		  15: done
		  16: done
		  17: done
		  18: done
		  19: done
		  20: done
		  21: done
		  22: done
		  23: done
		  24: done
		  25: done
		  26: done
		  27: done
		  28: done
		  29: done
		  30: done
		  31: done
		  32: done
		I190610 08:43:34.293900 1 cluster_synced.go:1019  put /home/agent/work/.go/src/github.com/cockroachdb/cockroach/cockroach.linux-2.6.32-gnu-amd64 failed
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/90841a6559df9d9a4724e1d30490951bbdb811b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364443&tab=buildLog

The test failed on branch=provisional_201906271846_v19.2.0-alpha.20190701, cloud=gce:
	test.go:1235: test timed out (3h0m0s)
	cluster.go:1870,import.go:188,test.go:1249: context canceled
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,import.go:188,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1364443-import-tpch-nodes-32 --oneshot --ignore-empty-nodes: exit status 1 3: error: exit status 255
		19: error: exit status 255
		2: error: exit status 255
		29: error: exit status 255
		5: error: exit status 255
		13: error: exit status 255
		1: error: exit status 255
		27: error: exit status 255
		6: error: exit status 255
		7: error: exit status 255
		8: error: exit status 255
		22: error: exit status 255
		28: error: exit status 255
		26: error: exit status 255
		23: error: exit status 255
		9: error: exit status 255
		20: error: exit status 255
		15: error: exit status 255
		16: error: exit status 255
		24: error: exit status 255
		21: error: exit status 255
		25: error: exit status 255
		4: error: exit status 255
		32: error: exit status 255
		10: error: exit status 255
		31: error: exit status 255
		11: error: exit status 255
		30: error: exit status 255
		17: error: exit status 255
		14: error: exit status 255
		18: error: exit status 255
		12: error: exit status 255
		Error:  3: error: exit status 255, 19: error: exit status 255, 2: error: exit status 255, 29: error: exit status 255, 5: error: exit status 255, 13: error: exit status 255, 1: error: exit status 255, 27: error: exit status 255, 6: error: exit status 255, 7: error: exit status 255, 8: error: exit status 255, 22: error: exit status 255, 28: error: exit status 255, 26: error: exit status 255, 23: error: exit status 255, 9: error: exit status 255, 20: error: exit status 255, 15: error: exit status 255, 16: error: exit status 255, 24: error: exit status 255, 21: error: exit status 255, 25: error: exit status 255, 4: error: exit status 255, 32: error: exit status 255, 10: error: exit status 255, 31: error: exit status 255, 11: error: exit status 255, 30: error: exit status 255, 17: error: exit status 255, 14: error: exit status 255, 18: error: exit status 255, 12: error: exit status 255

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/537767ac9daa52b0026bb957d7010e3b88b61071

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1364821&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (3h0m0s)
	cluster.go:1870,import.go:188,test.go:1249: context canceled
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,import.go:188,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1364821-import-tpch-nodes-32 --oneshot --ignore-empty-nodes: exit status 1 22: error: exit status 255
		11: error: exit status 255
		10: error: exit status 255
		27: error: exit status 255
		25: error: exit status 255
		1: error: exit status 255
		15: error: exit status 255
		19: error: exit status 255
		14: error: exit status 255
		12: error: exit status 255
		20: error: exit status 255
		30: error: exit status 255
		28: error: exit status 255
		21: error: exit status 255
		13: error: exit status 255
		31: error: exit status 255
		17: error: exit status 255
		2: error: exit status 255
		29: error: exit status 255
		6: error: exit status 255
		18: error: exit status 255
		8: error: exit status 255
		5: error: exit status 255
		23: error: exit status 255
		32: error: exit status 255
		24: error: exit status 255
		3: error: exit status 255
		16: error: exit status 255
		7: error: exit status 255
		9: error: exit status 255
		26: error: exit status 255
		4: error: exit status 255
		Error:  22: error: exit status 255, 11: error: exit status 255, 10: error: exit status 255, 27: error: exit status 255, 25: error: exit status 255, 1: error: exit status 255, 15: error: exit status 255, 19: error: exit status 255, 14: error: exit status 255, 12: error: exit status 255, 20: error: exit status 255, 30: error: exit status 255, 28: error: exit status 255, 21: error: exit status 255, 13: error: exit status 255, 31: error: exit status 255, 17: error: exit status 255, 2: error: exit status 255, 29: error: exit status 255, 6: error: exit status 255, 18: error: exit status 255, 8: error: exit status 255, 5: error: exit status 255, 23: error: exit status 255, 32: error: exit status 255, 24: error: exit status 255, 3: error: exit status 255, 16: error: exit status 255, 7: error: exit status 255, 9: error: exit status 255, 26: error: exit status 255, 4: error: exit status 255

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/86154ae6ae36e286883d8a6c9a4111966198201d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1367379&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (3h0m0s)
	cluster.go:1870,import.go:188,test.go:1249: context canceled
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,import.go:188,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1367379-import-tpch-nodes-32 --oneshot --ignore-empty-nodes: exit status 1 19: error: exit status 255
		2: error: exit status 255
		5: error: exit status 255
		6: error: exit status 255
		3: error: exit status 255
		12: error: exit status 255
		13: error: exit status 255
		1: error: exit status 255
		4: error: exit status 255
		16: error: exit status 255
		29: error: exit status 255
		11: error: exit status 255
		10: error: exit status 255
		28: error: exit status 255
		17: error: exit status 255
		27: error: exit status 255
		23: error: exit status 255
		20: error: exit status 255
		21: error: exit status 255
		7: error: exit status 255
		8: error: exit status 255
		18: error: exit status 255
		22: error: exit status 255
		30: error: exit status 255
		15: error: exit status 255
		31: error: exit status 255
		24: error: exit status 255
		25: error: exit status 255
		26: error: exit status 255
		14: error: exit status 255
		9: error: exit status 255
		32: error: exit status 255
		Error:  19: error: exit status 255, 2: error: exit status 255, 5: error: exit status 255, 6: error: exit status 255, 3: error: exit status 255, 12: error: exit status 255, 13: error: exit status 255, 1: error: exit status 255, 4: error: exit status 255, 16: error: exit status 255, 29: error: exit status 255, 11: error: exit status 255, 10: error: exit status 255, 28: error: exit status 255, 17: error: exit status 255, 27: error: exit status 255, 23: error: exit status 255, 20: error: exit status 255, 21: error: exit status 255, 7: error: exit status 255, 8: error: exit status 255, 18: error: exit status 255, 22: error: exit status 255, 30: error: exit status 255, 15: error: exit status 255, 31: error: exit status 255, 24: error: exit status 255, 25: error: exit status 255, 26: error: exit status 255, 14: error: exit status 255, 9: error: exit status 255, 32: error: exit status 255

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ca1ef4d4f8296b213c0b2b140f16e4a97931e6e7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=import/tpch/nodes=32 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1368144&tab=buildLog

The test failed on branch=master, cloud=gce:
	test.go:1235: test timed out (3h0m0s)
	cluster.go:1870,import.go:188,test.go:1249: context canceled
	cluster.go:1033,context.go:122,cluster.go:1022,panic.go:406,test.go:783,test.go:769,cluster.go:1870,import.go:188,test.go:1249: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-1368144-import-tpch-nodes-32 --oneshot --ignore-empty-nodes: exit status 1 11: error: exit status 255
		19: error: exit status 255
		7: error: exit status 255
		1: error: exit status 255
		22: error: exit status 255
		2: error: exit status 255
		30: error: exit status 255
		13: error: exit status 255
		23: error: exit status 255
		31: error: exit status 255
		17: error: exit status 255
		26: error: exit status 255
		25: error: exit status 255
		27: error: exit status 255
		8: error: exit status 255
		24: error: exit status 255
		15: error: exit status 255
		10: error: exit status 255
		18: error: exit status 255
		29: error: exit status 255
		3: error: exit status 255
		21: error: exit status 255
		28: error: exit status 255
		4: error: exit status 255
		5: error: exit status 255
		20: error: exit status 255
		12: error: exit status 255
		32: error: exit status 255
		16: error: exit status 255
		14: error: exit status 255
		9: error: exit status 255
		6: error: exit status 255
		Error:  11: error: exit status 255, 19: error: exit status 255, 7: error: exit status 255, 1: error: exit status 255, 22: error: exit status 255, 2: error: exit status 255, 30: error: exit status 255, 13: error: exit status 255, 23: error: exit status 255, 31: error: exit status 255, 17: error: exit status 255, 26: error: exit status 255, 25: error: exit status 255, 27: error: exit status 255, 8: error: exit status 255, 24: error: exit status 255, 15: error: exit status 255, 10: error: exit status 255, 18: error: exit status 255, 29: error: exit status 255, 3: error: exit status 255, 21: error: exit status 255, 28: error: exit status 255, 4: error: exit status 255, 5: error: exit status 255, 20: error: exit status 255, 12: error: exit status 255, 32: error: exit status 255, 16: error: exit status 255, 14: error: exit status 255, 9: error: exit status 255, 6: error: exit status 255

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A for of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jul 3, 2019
Fixes cockroachdb#34180.
Fixes cockroachdb#35493.
Fixes cockroachdb#36983.
Fixes cockroachdb#37108.
Fixes cockroachdb#37371.
Fixes cockroachdb#37384.
Fixes cockroachdb#37551.
Fixes cockroachdb#37879.
Fixes cockroachdb#38095.
Fixes cockroachdb#38131.
Fixes cockroachdb#38136.
Fixes cockroachdb#38549.
Fixes cockroachdb#38552.
Fixes cockroachdb#38555.
Fixes cockroachdb#38560.
Fixes cockroachdb#38562.
Fixes cockroachdb#38563.
Fixes cockroachdb#38569.
Fixes cockroachdb#38578.
Fixes cockroachdb#38600.

_A lot of the early issues fixed by this had previous failures, but nothing
very recent or actionable. I think it's worth closing them now that they
should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is
no longer released when Replica.propose fails. This used to happen
[here](cockroachdb@1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316),
but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and
`scrub/all-checks/tpcc/w=100` roachtests. About half the time, the
import would stall after a few hours and the roachtest health reports
would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`.
I tracked the stalled latch acquisition to a stalled proposal quota acquisition
by a conflicting command. The range debug page showed the following:

<image>

We see that the leaseholder of the Range has no pending commands
but also no available proposal quota. This indicates a proposal
quota leak, which led to me finding the lost release in this
error case.

The (now confirmed) theory for what went wrong in these roachtests is that
they are performing imports, which generate a large number of AddSSTRequests.
These requests are typically larger than the available proposal quota
for a range, meaning that they request all of its available quota. The
effect of this is that if even a single byte of quota is leaked, the entire
range will seize up and stall when an AddSSTRequests is issued.
Instrumentation revealed that a ChangeReplicas request with a quota size
equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back
into the pool, causing future requests to get stuck indefinitely waiting for
leaked quota, stalling the entire import.

Release note: None
craig bot pushed a commit that referenced this issue Jul 3, 2019
38632: storage: release quota on failed Raft proposals r=tbg a=nvanbenschoten

Fixes #34180.
Fixes #35493.
Fixes #36983.
Fixes #37108.
Fixes #37371.
Fixes #37384.
Fixes #37551.
Fixes #37879.
Fixes #38095.
Fixes #38131.
Fixes #38136.
Fixes #38549.
Fixes #38552.
Fixes #38555.
Fixes #38560.
Fixes #38562.
Fixes #38563.
Fixes #38569.
Fixes #38578.
Fixes #38600.

_A lot of the early issues fixed by this had previous failures, but nothing very recent or actionable. I think it's worth closing them now that they should be fixed in the short term._

This fixes a bug introduced in 1ff3556 where Raft proposal quota is no longer released when `Replica.propose` fails. This used to happen [here](1ff3556#diff-4315c7ebf8b8bf7bda469e1e7be82690L316), but that code was accidentally lost in the rewrite.

I tracked this down by running a series of `import/tpch/nodes=4` and `scrub/all-checks/tpcc/w=100` roachtests. About half the time, the import would stall after a few hours and the roachtest health reports would start logging lines like: `n1/s1  2.00  metrics  requests.slow.latch`. I tracked the stalled latch acquisition to a stalled proposal quota acquisition by a conflicting command. The range debug page showed the following:

![Screenshot_2019-07-01 r56 Range Debug Cockroach Console](https://user-images.githubusercontent.com/5438456/60554197-8519c780-9d04-11e9-8cf5-6c46ffbcf820.png)

We see that the Leaseholder of the Range has no pending commands but also no available proposal quota. This indicates a proposal quota leak, which led to me finding the lost release in this error case.

The (now confirmed) theory for what went wrong in these roachtests is that they are performing imports, which generate a large number of AddSSTRequests. These requests are typically larger than the available proposal quota for a range, meaning that they request all of its available quota. The effect of this is that if even a single byte of quota is leaked, the entire range will seize up and stall when an AddSSTRequests is issued. Instrumentation revealed that a ChangeReplicas request with a quota size equal to the leaked amount was failing due to the error:
```
received invalid ChangeReplicasTrigger REMOVE_REPLICA((n3,s3):3): updated=[(n1,s1):1 (n4,s4):2 (n2,s2):4] next=5 to remove self (leaseholder)
```
Because of the missing error handling, this quota was not being released back into the pool, causing future requests to get stuck indefinitely waiting for leaked quota, stalling the entire import.

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in #38632 Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants