-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS Resolver can get stuck with no connections #1795
Comments
This is a valid concern, thanks for bringing this up. This happens when there's no retry happening in the ClientConn, could be caused by one of the following reasons:
For "Resolver returned an empty list". It could be a valid address list because the resolver may want to delete the previous returned list. To solve the re-resolve problem, there are two possible solutions:
For "All addrConns stopped for non-temporary errors". I think the solution would be to keep retrying on those connections. (Filed #1856 for this) |
Thanks |
I'm having exactly the same issue with dns resolver. If my VPN connectivity drops for some time it takes about 30 mins for my client to reconnect. I'm seeing this line in the log and after that ClientConn state changes to
When I use Is there a plan to fix "1. The resolver keeps retry itself"? |
Please answer these questions before submitting your issue.
What version of gRPC are you using?
1.9.1
What version of Go are you using (
go version
)?1.9
What operating system (Linux, Windows, …) and version?
Linux
What did you do?
I'm using a RoundRobin load balancer with the DNS resolver and the MaxConnectionAge set as per the discussion in #1663. All is well and good unless for some reason ALL the available servers are stopped and the resolver gets "no such host" for the given name for a period of time, When that happens the dns resolver gets stuck. It has no connections closing due to MaxConnection age to prompt it to re-resolve and so it will sit and wait for the entire 30 minute refresh frequency before attempting to resolve the names again.
A concrete example of this is when updating the images on a kubernetes deployment. The known connections are all closed before the kubernetes DNS has been updated with the new pod ip addresses.
The upshot of this is that if for some reason all instances of a server need to be stopped at the same time then any client using them will also need to be restarted when they come back up, unless it's OK to wait for 30 minutes for the resolver to query the DNS again, which in most cases it won't be.
It's possible that there is some client side setting I've missed which is the root cause of this. I've tried the client keep alive configuration, but that doesn't change anything, which makes sense because in this case there aren't any connections to keep alive. Backoff config already has defaults set and the context in dial only applies to when dial is initially called as far as I can see.
I'm able to work around this in my case with the minReadySeconds configuration on the kubernetes deployment but I figured it was worth brining to your attention.
The text was updated successfully, but these errors were encountered: