-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46774
Comments
@irfansharif do you mind taking a look at this? It's possible that you've already fixed this issue. |
This looks like an infra flake, and would also explain the recent retry errors we've seen in the scaledata tests. @jlinder, mind teaching me why an error code of 255 is considered to be a failure in the SSH connection? I couldn't find anything googling for it. cockroach/pkg/cmd/roachprod/errors/errors.go Lines 148 to 150 in e3a23df
|
@irfansharif Sure. From the ssh man page:
Thus, the ssh exit code is for any error from ssh itself (timeout, connection error, invalid credentials, ...). Taking a look at the logs. |
I guess what I was curious about is exactly what kind of error occurred? Is there a way for the ssh process itself to emit logs for consumption? Naively I'd expect something like that to exist and part of the build artifacts for all these roachtests given how heavily we rely on ssh. |
What the logs tell us is that it appears like one of the host VMs disappeared. This is determinable from two things: 1) the SSH_PROBLEM report and 2) that the The information about what happened gets printed to stdout / stderr by roachprod. This test appears to have captured all the output for the failed call in the file There were two very similar issues raised yesterday. See this GitHub issue and this slack thread (where most of the conversation happened). On difference in the captured logs is that they didn't contain a log file like the one I noted above. |
@jlinder thanks for the details! For my own learning again, how do “VMs disappear”? I was actually asking about a different kind of log that doesn't exist today. From that slack thread, re: “255 [...] covers connection timeouts, connection issues after a connection is established, errors with credentials, and everything else”, is there any way to have |
@irfansharif Gladly! "VMs disappearing" can happen in a couple ways. 1) the physical hardware of the computer it's on in the cloud provider's datacenter goes offline (some hardware failure in the machine that takes the machine offline, a power failure to the machine / rack / data center, a router failure in the data center, ...). 2) something in the software of the VM gets out of control and causes the VM to stop responding (out of disk space / ram / swap such that it locks up, all the file descriptors are in use, somehow Telling ssh to be verbose in its output would give more information ( The motivation for adding better exit codes and related easy-to-search-for log comments to roachprod was specifically to 1) get a better handle on how many infra-specific vs. non-infra problems are happening with roachtest/roachprod in TeamCity and being able to easily find the problems in the logs and 2) make it easier for engineers to understand the errors happening in the logs. Reducing infra flakiness here is part of 1). I'm not sure what can be done to make roachprod code more ssh failure tolerant; perhaps that will become clearer over time. Note: some places in the roachprod code that open ssh connections will not report |
Very broadly I imagine it to look something like ensuring all our @nvanbenschoten: I'll close this issue out, there's nothing here for me to do. |
(roachtest).scaledata/distributed_semaphore/nodes=3 failed on release-20.1@6a7ca722a135e21ad04daec3895535969ba5b02c:
More
Artifacts: /scaledata/distributed_semaphore/nodes=3
Related:
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46604 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46602 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-release-19.1 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46455 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-provisional_202003240059_v20.1.0-beta.4 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46347 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-provisional_202003200044_v20.1.0-beta.3 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #46281 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-provisional_202003181957_v20.1.0-beta.3 release-blocker
roachtest: scaledata/distributed_semaphore/nodes=3 failed #43839 roachtest: scaledata/distributed_semaphore/nodes=3 failed C-test-failure O-roachtest O-robot branch-master
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: