Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: scaledata/distributed_semaphore/nodes=3 failed #46774

Closed
cockroach-teamcity opened this issue Mar 31, 2020 · 8 comments
Closed

roachtest: scaledata/distributed_semaphore/nodes=3 failed #46774

cockroach-teamcity opened this issue Mar 31, 2020 · 8 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).scaledata/distributed_semaphore/nodes=3 failed on release-20.1@6a7ca722a135e21ad04daec3895535969ba5b02c:

		    main.(*monitor).Go.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2344
		    github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: output in %s
		    -- arg 1: <string>
		  - output in run_073723.919_n4_distributedsemaphore_:
		  - error with attached stack trace:
		    main.execCmd
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:406
		    main.(*cluster).RunL
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2019
		    main.(*cluster).RunE
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2000
		    main.runSqlapp.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/scaledata.go:108
		    main.(*monitor).Go.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2344
		    github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
		    	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
		    runtime.goexit
		    	/usr/local/go/src/runtime/asm_amd64.s:1357
		  - error with embedded safe details: %s returned:
		    stderr:
		    %s
		    stdout:
		    %s
		    -- arg 1: <string>
		    -- arg 2: <string>
		    -- arg 3: <string>
		  - /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1838926-1585639990-12-n4cpu4:4 -- ./distributed_semaphore  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.217:26257,10.128.0.226:26257,10.128.0.234:26257'  returned:
		    stderr:
		    ckroach.go:54pq: communication error: rpc error: code = Canceled desc = context canceled
		    2020/03/31 07:38:52 ExecuteTx retry attempt 1 failed, started at 2020-03-31 07:38:50.271151788 +0000 UTC m=+85.628556355, now = 2020-03-31 07:38:52.328816989 +0000 UTC m=+87.686221567, took 2.057665212s
		    2020/03/31 07:38:52 pq error - Error code : 58C01, Error class : 58
		    2020/03/31 07:38:52 pq error - Error code : 58C01, Error class : 58
		    2020/03/31 07:38:52 Aborting Retries because this error of type *pq.Error is not retryable : pq: communication error: rpc error: code = Canceled desc = context canceled
		    2020/03/31 07:38:52 postgres error code is 58C01 and class is 58
		    2020/03/31 07:38:52 pq: communication error: rpc error: code = Canceled desc = context canceled
		    Error: SSH_PROBLEM:
		      - error with user detail: Node 4. Command with error:
		        ```
		        ./distributed_semaphore  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.217:26257,10.128.0.226:26257,10.128.0.234:26257'
		        ```
		      - exit status 255
		    
		    stdout::
		  - exit status 10

More

Artifacts: /scaledata/distributed_semaphore/nodes=3
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-20.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 31, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone Mar 31, 2020
@nvanbenschoten
Copy link
Member

@irfansharif do you mind taking a look at this? It's possible that you've already fixed this issue.

@irfansharif
Copy link
Contributor

irfansharif commented Apr 2, 2020

This looks like an infra flake, and would also explain the recent retry errors we've seen in the scaledata tests. @jlinder, mind teaching me why an error code of 255 is considered to be a failure in the SSH connection? I couldn't find anything googling for it.

if exitErr.ExitCode() == 255 {
return SSH{err}
}

@jlinder
Copy link
Collaborator

jlinder commented Apr 2, 2020

@irfansharif Sure. From the ssh man page:

ssh exits with the exit status of the remote command or with 255 if an error occurred.

Thus, the ssh exit code is for any error from ssh itself (timeout, connection error, invalid credentials, ...).

Taking a look at the logs.

@irfansharif
Copy link
Contributor

I guess what I was curious about is exactly what kind of error occurred? Is there a way for the ssh process itself to emit logs for consumption? Naively I'd expect something like that to exist and part of the build artifacts for all these roachtests given how heavily we rely on ssh.

@jlinder
Copy link
Collaborator

jlinder commented Apr 2, 2020

What the logs tell us is that it appears like one of the host VMs disappeared. This is determinable from two things: 1) the SSH_PROBLEM report and 2) that the roachprod monitor failure indicating a cockroach node had died.

The information about what happened gets printed to stdout / stderr by roachprod. This test appears to have captured all the output for the failed call in the file run_1/run_073723.919_n4_distributedsemaphore_.failed.log. And it looks like the ssh connection died while the distributed_semaphore command was running. I figured out the name of the file to look at by seeing this code line mentioned in the stack trace for SSH_PROBLEM and then going to the cluster.RuneE() method and seeing it capture the log info and how it named the log file.

There were two very similar issues raised yesterday. See this GitHub issue and this slack thread (where most of the conversation happened). On difference in the captured logs is that they didn't contain a log file like the one I noted above.

@irfansharif
Copy link
Contributor

@jlinder thanks for the details! For my own learning again, how do “VMs disappear”?

I was actually asking about a different kind of log that doesn't exist today. From that slack thread, re: “255 [...] covers connection timeouts, connection issues after a connection is established, errors with credentials, and everything else”, is there any way to have ssh itself log the errors it encounters to some file? Ideally things are a little more visible than the opaque 255 error. My motivation here is #35326 (comment), which I was looking at a few months ago. The problem there was due to ssh, and it was pretty difficult to diagnose at the time. SSH_PROBLEM being a thing back then would've certainly helped, but I'm curious if there's more that could be be surfaced (to perhaps better understand why our infra is as flakey as it is, and to perhaps motivate hardening our roachprod code to be more ssh failure tolerant).

@jlinder
Copy link
Collaborator

jlinder commented Apr 2, 2020

@irfansharif Gladly!

"VMs disappearing" can happen in a couple ways. 1) the physical hardware of the computer it's on in the cloud provider's datacenter goes offline (some hardware failure in the machine that takes the machine offline, a power failure to the machine / rack / data center, a router failure in the data center, ...). 2) something in the software of the VM gets out of control and causes the VM to stop responding (out of disk space / ram / swap such that it locks up, all the file descriptors are in use, somehow rm -rf / gets run by a user with root perms, ...).

Telling ssh to be verbose in its output would give more information (ssh -v) is how to get more info from ssh about why it died. I see some commented out code in roachprod that did this. There's something to look more at there and I have it on my radar.

The motivation for adding better exit codes and related easy-to-search-for log comments to roachprod was specifically to 1) get a better handle on how many infra-specific vs. non-infra problems are happening with roachtest/roachprod in TeamCity and being able to easily find the problems in the logs and 2) make it easier for engineers to understand the errors happening in the logs. Reducing infra flakiness here is part of 1). I'm not sure what can be done to make roachprod code more ssh failure tolerant; perhaps that will become clearer over time.

Note: some places in the roachprod code that open ssh connections will not report SSH_PROBLEM yet. There's two or three (at least) mechanisms that invoke ssh and I only made the change for one of them. Errors for the other ones will show up under UNCLASSIFIED_PROBLEM until I (or someone) gets to changing the other ones.

@irfansharif
Copy link
Contributor

I'm not sure what can be done to make roachprod code more ssh failure tolerant; perhaps that will become clearer over time.

Very broadly I imagine it to look something like ensuring all our roachprod run commands are idempotent, so roachprod code could simply re-issue commands on flakey connection errors and the like. Sadly, not a thing today.

@nvanbenschoten: I'll close this issue out, there's nothing here for me to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

No branches or pull requests

5 participants