Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persistent "connection refused" errors after upgrade #1026

Closed
tamird opened this issue Dec 19, 2016 · 22 comments
Closed

persistent "connection refused" errors after upgrade #1026

tamird opened this issue Dec 19, 2016 · 22 comments

Comments

@tamird
Copy link
Contributor

tamird commented Dec 19, 2016

After merging cockroachdb/cockroach#9697 which bumped our version of grpc from 79b7c34 to 777daa1, we began to see the following error on one of our servers:

E161219 19:33:29.500130 457 storage/queue.go:597  [n2,replicaGC,s4,r6243/3:/Table/51/1/76{29630…-30792…},@c420d70900] failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 10.142.0.38:37617->10.142.0.44:26257: write: connection refused"

Note that most of the error is produced by our code, but the final fragment comes from grpc:

rpc error: code = 13 desc = connection error: desc = "transport: write tcp 10.142.0.38:37617->10.142.0.44:26257: write: connection refused"

This error first appeared after one of our other servers was restarted - during this time, some connection refused errors are expected, but the grpc internal reconnection logic should have cleared them after the other server became available.

The very last fragment of this error comes from net, not grpc:

write tcp 10.142.0.38:37617->10.142.0.44:26257: write: connection refused

and this is the strangest part - this appears to be a net.OpError wrapping a os.SyscallError, but note that the supposed syscall was write and the supported error was ECONNREFUSED, which is not an error that write should ever return, as far as I can tell.

cc @bradfitz @petermattis

@bradfitz
Copy link
Contributor

I don't know what version of Go you're talking about here. If you find a problem with the net package in Go 1.8 compared to Go 1.7, please file a bug at https://github.com/golang/go/issues/new . Or if you find something underdocumented, likewise file a bug.

I don't know what's changed in gRPC lately to know what's biting you. I don't regularly work on gRPC.

@tamird
Copy link
Contributor Author

tamird commented Dec 19, 2016 via email

@bdarnell
Copy link
Contributor

note that the supposed syscall was write and the supported error was ECONNREFUSED, which is not an error that write should ever return, as far as I can tell.

Writing to a socket can return ECONNREFUSED (on linux, but not bsd) if the connect was in non-blocking mode and you haven't consumed the error with getsockopt (or maybe this is a quirk of the python socket wrappers? I haven't tried to reproduce this in go)

>>> import socket
>>> s=socket.socket()
>>> s.setblocking(False)
>>> s.connect(('localhost', 54321))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 115] Operation now in progress
>>> s.send(b'asdf')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
socket.error: [Errno 111] Connection refused

We've seen issues caused by write() returning unexpected errors before, in golang/go#14548.

@F21
Copy link

F21 commented Jan 12, 2017

I am also seeing something similiar, but in my case, I have only started using grpc since 708a7f9. I am currently using Go 1.8Beta2.

In my case, I have a grpc server running as a http handler for an http server using ServeHTTP(). I also have mutual TLS authentication enabled and am using a self-signed certificate for the http server.

The set up is being used in an end-to-end test.

This is what is being logged:

2017/01/12 09:18:13 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2017/01/12 09:18:13 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:14 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:16 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:19 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:22 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:29 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}
2017/01/12 09:18:41 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::1]:443: getsockopt: connection refused"; Reconnecting to {localhost:443 <nil>}

If I start the server and use time.Sleep() to wait before dialling to the server, I still get these errors, which is really strange. According to the timestamps, these happen upon the first connection. Some how, the connection does succeed and the test passes.

@bradfitz
Copy link
Contributor

Yeah, @F21, I confirm that TestServerCredsDispatch loops forever with that failure log message when using Go 1.8rc1.

@bradfitz
Copy link
Contributor

Wait, I also see this on Go 1.7.

@bradfitz
Copy link
Contributor

@MakMukhi, are you the owner of this?

@bradfitz
Copy link
Contributor

This is still happening (I'm at 5095579):

$ go test -v -run=TestServerCredsDispatch google.golang.org/grpc/test
=== RUN   TestServerCredsDispatch
2017/01/26 17:28:34 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:35 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:37 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:40 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:44 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:28:51 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
2017/01/26 17:29:01 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp [::]:60765: connect: network is unreachable"; Reconnecting to {[::]:60765 <nil>}
....

@tamird
Copy link
Contributor Author

tamird commented Jan 26, 2017

@bradfitz note that your issue is not the one I reported here; perhaps it deserves its own issue.

@bradfitz
Copy link
Contributor

If anybody knew what the issue was, it'd probably be fixed. :)

@tamird
Copy link
Contributor Author

tamird commented Jan 26, 2017

To clarify, the issue I reported here produces returned errors of the form: "transport: write tcp 10.142.0.38:37617->10.142.0.44:26257: write: connection refused" (emphasis mine).

The issue you're reporting here is different, and results in "ambient" error logging of the form: "transport: dial tcp [::]:60765: connect: network is unreachable" (emphasis again mine).

@bradfitz
Copy link
Contributor

bradfitz commented Jan 26, 2017

I'll trust you if you think they're different. I haven't looked. I forked my bug report off into #1058.

@MakMukhi
Copy link
Contributor

@tamird Can you send some more logs around the error? For instance, if any other errors were seen (http2client.notifyError etc.)
Also, some context about the connecting client would be useful too;
For instance, what backoff strategy was used since I see here cockroachdb/cockroach#12713 that the reconnect happened every second which seems odd.
Also, if I understand it right: You have a client that is connected to a server but then the server goes away(abruptly ?), and the client here keeps trying to reconnect to it. The server later comes up but the client still won't connect?
An arrow in the dark: when the said server comes up, does the ip address change and if so does the balancer's name resolver get updated to map to the right address?

@bdarnell
Copy link
Contributor

This is happening right now on one of our test clusters. The IPs have not changed; one server crashed, was down for a time, and was restarted. The logs are full of this:

W170127 02:50:04.374599 96914562 storage/intent_resolver.go:352  [n1,s1,r64/7:/Table/51/1/14{12136…-29906…}]: failed to resolve intents: failed to send RPC: sending to all
 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refused
"
W170127 02:50:04.798731 96914733 storage/intent_resolver.go:352  [n1,s1,r422/1:/Table/51/1/40{39115…-57292…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:04.848520 96914767 storage/intent_resolver.go:352  [n1,s1,r423/3:/Table/51/1/86{26570…-44684…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:04.908616 96914878 storage/intent_resolver.go:352  [n1,s1,r428/3:/Table/51/1/83{72888…-90988…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:04.996322 96915001 storage/intent_resolver.go:352  [n1,s1,r428/3:/Table/51/1/83{72888…-90988…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
W170127 02:50:05.004075 96914928 storage/intent_resolver.go:352  [n1,s1,r423/3:/Table/51/1/86{26570…-44684…}]: failed to resolve intents: failed to send RPC: sending to al
l 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refuse
d"
W170127 02:50:05.042729 96915124 storage/intent_resolver.go:352  [n1,s1,r45/5:/Table/51/1/13{41272…-58971…}]: failed to resolve intents: failed to send RPC: sending to all
 3 replicas failed; last error: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.4:33844->192.168.1.10:26257: write: connection refused
"

Note that the error is write: connection refused, so we're getting ECONNREFUSED from a Write call (which isn't really supposed to happen, but I can't tell whether it's explicitly forbidden). And the local port number (33844) is constant, so it's not trying to reconnect over and over. Instead, what appears to be happening is that because the "connection refused" error is happening in an unexpected context, grpc continues to retry on the failed connection instead of closing it so it can reopen a new one.

@MakMukhi
Copy link
Contributor

MakMukhi commented Jan 28, 2017 via email

@sailat
Copy link

sailat commented Feb 16, 2017

@tamird @bdarnell are you still seeing this issue?

@bdarnell
Copy link
Contributor

Yes, it still happens. Here's a recent log snippet:

[email protected]: I170219 02:03:25.681933 693 vendor/google.golang.org/grpc/clientconn.go:866  grpc: addrConn.transportMonitor exits due to: connection error: desc = "transport: write tcp 192.168.1.10:39162->192.168.1.9:26257: write: connection refused"
[email protected]: E170219 02:07:48.257687 12830805 storage/replica_command.go:1851  [replica consistency checker,n7,s7,r130/59:/Table/51/1/756{11808…-23002…},@c4216af500] could not CollectChecksum from replica {6 6 61}: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.10:39162->192.168.1.9:26257: write: connection refused"
[email protected]: E170219 02:11:12.643325 13043929 storage/replica_command.go:1851  [replica consistency checker,n7,s7,r2104/25:/Table/51/1/775{32005…-43245…},@c4212ed180] could not CollectChecksum from replica {6 6 26}: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.10:39162->192.168.1.9:26257: write: connection refused"
[email protected]: E170219 02:20:16.612271 13671648 storage/replica_command.go:1851  [replica consistency checker,n7,s7,r558/46:/Table/51/1/474{18512…-29719…},@c421240000] could not CollectChecksum from replica {6 6 38}: rpc error: code = 13 desc = connection error: desc = "transport: write tcp 192.168.1.10:39162->192.168.1.9:26257: write: connection refused"

We haven't managed to reduce this to a simple reproduction, but this is on a test cluster where we periodically restart random nodes from cron (when this happens, the process is killed with -9, left down for 5 minutes, then restarted).

@bradfitz
Copy link
Contributor

@bdarnell, sorry for the threadjack, but does Cockroach use streaming RPCs? I ask because I also ran into some gRPC problems but while debugging I just decided to delete gRPC-go's http2 stack and retrofit gRPC-go atop Go's native http2 support. The retrofit worked for me, but I didn't need streaming RPCs so I didn't implement them yet. Similarly, h2c support is easy but not yet done, because I didn't need it. (I just needed to do simple RPCs out to Google services over https)

@bdarnell
Copy link
Contributor

Yes, we use streaming RPCs extensively (and we need h2c support as well)

@bradfitz
Copy link
Contributor

Okay. If you're interested, you can follow along at bradfitz#3 and bradfitz#4

@bdarnell
Copy link
Contributor

I think there are two sides to this problem, one in Go's net package and one in GRPC.

On the Go side, I hypothesized in golang/go#14548 that the net poller was subject to spurious wakeups, which could result in Dial returning with nil error before the connection was completed. On Darwin, premature writes to a socket that is not yet connected would fail, so a Darwin-specific fix was put in place to ensure that the connection was complete before Dial returns. On Linux, the same spurious wakeup scenario is not a problem for connections that succeed, but it does mean that connection-related errors like ECONNREFUSED can be observed from system calls like write, where they are not otherwise possible. I'll file a Go issue for further investigation about how this should be resolved, but for now let's look at how GRPC handles this error.

In GRPC, this shifts the handling of this error from dial, where every error is a temporary connectionError (unless FailOnNonTempDialError is set, which is not the case for us), to creds.ClientHandshake, which may return a permanent error according to the result of isTemporary. (ECONNREFUSED is not temporary). Elsewhere in grpc/transport, errors from Write always turn into temporary connectionErrors, regardless of what isTemporary says.

I propose that if creds.ClientHandshake returns a net.Error and FailOnNonTempDialError is false, the connectionError should be considered temporary (if there are other more precise ways to distinguish the kinds of certificate errors that indicate a fatal misconfiguration, that would be nice, but it's not clear to me where exactly to draw the line). We could even go so far as to say that the TLS handshake is logically a part of the dial process and only return permanent errors for certificate problems if FailOnNonTempDialError is true, although that may be taking things too far.

Finally in Cockroach, we should probably be detecting connections in the "shutdown" state and A) make loud noises about them and B) retry them with an appropriate backoff.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Feb 27, 2017
Works around grpc/grpc-go#1026, which results in connections getting
permanently wedged by "connection refused" errors being returned at
the wrong time.
@tamird
Copy link
Contributor Author

tamird commented Oct 22, 2017

I believe golang/go@bf0f692 fixed this for Go1.9.

@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants