roachtest: scaledata/distributed_semaphore/nodes=3 failed #46774

cockroach-teamcity · 2020-03-31T07:39:31Z

(roachtest).scaledata/distributed_semaphore/nodes=3 failed on release-20.1@6a7ca722a135e21ad04daec3895535969ba5b02c:

		    main.(*monitor).Go.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2344
		    github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_073723.919_n4_distributedsemaphore_:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.runSqlapp.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:108
		    main.(*monitor).Go.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2344
		    github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1838926-1585639990-12-n4cpu4:4 -- ./distributed_semaphore  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.217:26257,10.128.0.226:26257,10.128.0.234:26257'  returned:
		    stderr:
		    ckroach.go:54pq: communication error: rpc error: code = Canceled desc = context canceled
		    2020/03/31 07:38:52 ExecuteTx retry attempt 1 failed, started at 2020-03-31 07:38:50.271151788 +0000 UTC m=+85.628556355, now = 2020-03-31 07:38:52.328816989 +0000 UTC m=+87.686221567, took 2.057665212s
		    2020/03/31 07:38:52 pq error - Error code : 58C01, Error class : 58
		    2020/03/31 07:38:52 pq error - Error code : 58C01, Error class : 58
		    2020/03/31 07:38:52 Aborting Retries because this error of type *pq.Error is not retryable : pq: communication error: rpc error: code = Canceled desc = context canceled
		    2020/03/31 07:38:52 postgres error code is 58C01 and class is 58
		    2020/03/31 07:38:52 pq: communication error: rpc error: code = Canceled desc = context canceled
		    Error: SSH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./distributed_semaphore  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.217:26257,10.128.0.226:26257,10.128.0.234:26257'
		        ```
		      - exit status 255
		    
		    stdout::
		  - exit status 10

More

Artifacts: /scaledata/distributed_semaphore/nodes=3
Related:

roachtest: scaledata/distributed_semaphore/nodes=3 failed #46604 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46602 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-release-19.1 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46455 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-provisional_202003240059_v20.1.0-beta.4 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46347 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-provisional_202003200044_v20.1.0-beta.3 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46281 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #43839 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-master

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2020-04-02T02:49:40Z

@irfansharif do you mind taking a look at this? It's possible that you've already fixed this issue.

irfansharif · 2020-04-02T14:48:32Z

This looks like an infra flake, and would also explain the recent retry errors we've seen in the scaledata tests. @jlinder, mind teaching me why an error code of 255 is considered to be a failure in the SSH connection? I couldn't find anything googling for it.

cockroach/pkg/cmd/roachprod/errors/errors.go

Lines 148 to 150 in e3a23df

    
           if exitErr.ExitCode() == 255 { 
        
           	return SSH{err} 
        
           }

jlinder · 2020-04-02T15:50:47Z

@irfansharif Sure. From the ssh man page:

ssh exits with the exit status of the remote command or with 255 if an error occurred.

Thus, the ssh exit code is for any error from ssh itself (timeout, connection error, invalid credentials, ...).

Taking a look at the logs.

irfansharif · 2020-04-02T16:00:17Z

I guess what I was curious about is exactly what kind of error occurred? Is there a way for the ssh process itself to emit logs for consumption? Naively I'd expect something like that to exist and part of the build artifacts for all these roachtests given how heavily we rely on ssh.

jlinder · 2020-04-02T19:43:45Z

What the logs tell us is that it appears like one of the host VMs disappeared. This is determinable from two things: 1) the SSH_PROBLEM report and 2) that the roachprod monitor failure indicating a cockroach node had died.

The information about what happened gets printed to stdout / stderr by roachprod. This test appears to have captured all the output for the failed call in the file run_1/run_073723.919_n4_distributedsemaphore_.failed.log. And it looks like the ssh connection died while the distributed_semaphore command was running. I figured out the name of the file to look at by seeing this code line mentioned in the stack trace for SSH_PROBLEM and then going to the cluster.RuneE() method and seeing it capture the log info and how it named the log file.

There were two very similar issues raised yesterday. See this GitHub issue and this slack thread (where most of the conversation happened). On difference in the captured logs is that they didn't contain a log file like the one I noted above.

irfansharif · 2020-04-02T20:16:08Z

@jlinder thanks for the details! For my own learning again, how do “VMs disappear”?

I was actually asking about a different kind of log that doesn't exist today. From that slack thread, re: “255 [...] covers connection timeouts, connection issues after a connection is established, errors with credentials, and everything else”, is there any way to have ssh itself log the errors it encounters to some file? Ideally things are a little more visible than the opaque 255 error. My motivation here is #35326 (comment), which I was looking at a few months ago. The problem there was due to ssh, and it was pretty difficult to diagnose at the time. SSH_PROBLEM being a thing back then would've certainly helped, but I'm curious if there's more that could be be surfaced (to perhaps better understand why our infra is as flakey as it is, and to perhaps motivate hardening our roachprod code to be more ssh failure tolerant).

jlinder · 2020-04-02T21:55:53Z

@irfansharif Gladly!

"VMs disappearing" can happen in a couple ways. 1) the physical hardware of the computer it's on in the cloud provider's datacenter goes offline (some hardware failure in the machine that takes the machine offline, a power failure to the machine / rack / data center, a router failure in the data center, ...). 2) something in the software of the VM gets out of control and causes the VM to stop responding (out of disk space / ram / swap such that it locks up, all the file descriptors are in use, somehow rm -rf / gets run by a user with root perms, ...).

Telling ssh to be verbose in its output would give more information (ssh -v) is how to get more info from ssh about why it died. I see some commented out code in roachprod that did this. There's something to look more at there and I have it on my radar.

The motivation for adding better exit codes and related easy-to-search-for log comments to roachprod was specifically to 1) get a better handle on how many infra-specific vs. non-infra problems are happening with roachtest/roachprod in TeamCity and being able to easily find the problems in the logs and 2) make it easier for engineers to understand the errors happening in the logs. Reducing infra flakiness here is part of 1). I'm not sure what can be done to make roachprod code more ssh failure tolerant; perhaps that will become clearer over time.

Note: some places in the roachprod code that open ssh connections will not report SSH_PROBLEM yet. There's two or three (at least) mechanisms that invoke ssh and I only made the change for one of them. Errors for the other ones will show up under UNCLASSIFIED_PROBLEM until I (or someone) gets to changing the other ones.

irfansharif · 2020-04-02T22:12:31Z

I'm not sure what can be done to make roachprod code more ssh failure tolerant; perhaps that will become clearer over time.

Very broadly I imagine it to look something like ensuring all our roachprod run commands are idempotent, so roachprod code could simply re-issue commands on flakey connection errors and the like. Sadly, not a thing today.

@nvanbenschoten: I'll close this issue out, there's nothing here for me to do.

cockroach-teamcity added branch-release-20.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 31, 2020

cockroach-teamcity added this to the 20.1 milestone Mar 31, 2020

cockroach-teamcity assigned andreimatei Mar 31, 2020

cockroach-teamcity mentioned this issue Apr 1, 2020

roachtest: scaledata/distributed_semaphore/nodes=3 failed #46604

Closed

nvanbenschoten assigned irfansharif and unassigned andreimatei Apr 2, 2020

irfansharif closed this as completed Apr 2, 2020

irfansharif mentioned this issue Apr 2, 2020

roachtest: scaledata/jobcoordinator/nodes=3 failed #46134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: scaledata/distributed_semaphore/nodes=3 failed #46774

roachtest: scaledata/distributed_semaphore/nodes=3 failed #46774

cockroach-teamcity commented Mar 31, 2020

nvanbenschoten commented Apr 2, 2020

irfansharif commented Apr 2, 2020 •

edited

Loading

jlinder commented Apr 2, 2020

irfansharif commented Apr 2, 2020

jlinder commented Apr 2, 2020

irfansharif commented Apr 2, 2020

jlinder commented Apr 2, 2020

irfansharif commented Apr 2, 2020

roachtest: scaledata/distributed_semaphore/nodes=3 failed #46774

roachtest: scaledata/distributed_semaphore/nodes=3 failed #46774

Comments

cockroach-teamcity commented Mar 31, 2020

nvanbenschoten commented Apr 2, 2020

irfansharif commented Apr 2, 2020 • edited Loading

jlinder commented Apr 2, 2020

irfansharif commented Apr 2, 2020

jlinder commented Apr 2, 2020

irfansharif commented Apr 2, 2020

jlinder commented Apr 2, 2020

irfansharif commented Apr 2, 2020

irfansharif commented Apr 2, 2020 •

edited

Loading