Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachprod: connection closed by remote host, or exit status -1 #36929

Closed
cockroach-teamcity opened this issue Apr 18, 2019 · 5 comments
Closed
Labels
A-roachprod branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/837e946efc272bd8a9e0e08484733f8755ff5ab1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/jobcoordinator/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1247401&tab=buildLog

The test failed on release-19.1:
	cluster.go:1255,scaledata.go:81,scaledata.go:53,test.go:1237: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1247401-scaledata-jobcoordinator-nodes-6:1-6 returned:
		stderr:
		
		stdout:
		COCKROACH_SKIP_ENABLING_DIAGNOSTIC_REPORTING=1 COCKROACH_ENABLE_RPC_COMPRESSION=false ./cockroach start --insecure --store=path=/mnt/data1/cockroach --log-dir=${HOME}/logs --background --cache=25% --max-sql-memory=25% --port=26257 --http-port=26258 --locality=cloud=gce,region=us-east1,zone=us-east1-b --join=34.73.67.22:26257 >> ${HOME}/logs/cockroach.stdout.log 2>> ${HOME}/logs/cockroach.stderr.log || (x=$?; cat ${HOME}/logs/cockroach.stderr.log; exit $x)
		Connection to 35.243.239.241 closed by remote host.
		
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.Cockroach.Start.func7
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cockroach.go:397
		github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install.(*SyncedCluster).Parallel.func1.1
			/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/install/cluster_synced.go:1420
		runtime.goexit
			/usr/local/go/src/runtime/asm_amd64.s:1333: 
		I190418 13:30:07.072436 1 cluster_synced.go:1502  command failed
		: exit status 1

@cockroach-teamcity cockroach-teamcity added this to the 19.1 milestone Apr 18, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Apr 18, 2019
@nvanbenschoten
Copy link
Member

Connection to 35.243.239.241 closed by remote host.

It's possible that this is addressed by #37001.

@nvanbenschoten nvanbenschoten changed the title roachtest: scaledata/jobcoordinator/nodes=6 failed roachtest: connection closed by remote host Apr 29, 2019
@nvanbenschoten nvanbenschoten removed their assignment Apr 29, 2019
@tbg
Copy link
Member

tbg commented May 9, 2019

This just happened in #37421 (comment) and it looks like we were failing to connect to the node that was the base for the treedist algo, i.e. 9 other nodes were ssh'ing into it at the same time. Might be a clue.

tbg added a commit to tbg/cockroach that referenced this issue May 9, 2019
See cockroachdb#36929. Whenever these flakes happen, it'll be good to have verbose
logs. Anecdotally we're seeing fewer of them now, perhaps due to cockroachdb#37001.

Release note: None
@tbg
Copy link
Member

tbg commented May 9, 2019

Another interesting thing above and in #37289 is that it took ~2m to roach prod put the binary (until it failed). That's a lot longer than it should take to distribute that binary, and I'm thinking that the thing that makes it last so long is probably waiting for the connection to fail.

@tbg
Copy link
Member

tbg commented May 9, 2019

Hmm, in #37289 something else might actually be going on. The -v scp output has this at the end:

Connection to 34.73.40.76 closed by remote host.
Transferred: sent 2308, received 1408 bytes, in 0.0 seconds
Bytes per second: sent 28896876.5, received 17628597.1
debug1: Exit status -1

Can't quite read whether this means that scp tried to transfer something and failed, or whether it just didn't find the source file. Either way, the 2min duration is suspicious, but this seems like a different problem than the above, perhaps.

craig bot pushed a commit that referenced this issue May 9, 2019
37424: roachprod: verbose sshd logging r=ajwerner a=tbg

See #36929. Whenever these flakes happen, it'll be good to have verbose
logs. Anecdotally we're seeing fewer of them now, perhaps due to #37001.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@tbg tbg changed the title roachtest: connection closed by remote host roachprod: connection closed by remote host, or exit status -1 May 9, 2019
@irfansharif
Copy link
Contributor

Going to close this due to inactivity. We have better logging around infra flakes now, so it should be caught elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-roachprod branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

5 participants