Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Resolver can get stuck with no connections #1795

Closed
robotlovesyou opened this issue Jan 15, 2018 · 3 comments · Fixed by #2201
Closed

DNS Resolver can get stuck with no connections #1795

robotlovesyou opened this issue Jan 15, 2018 · 3 comments · Fixed by #2201
Assignees

Comments

@robotlovesyou
Copy link

robotlovesyou commented Jan 15, 2018

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.9.1

What version of Go are you using (go version)?

1.9

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I'm using a RoundRobin load balancer with the DNS resolver and the MaxConnectionAge set as per the discussion in #1663. All is well and good unless for some reason ALL the available servers are stopped and the resolver gets "no such host" for the given name for a period of time, When that happens the dns resolver gets stuck. It has no connections closing due to MaxConnection age to prompt it to re-resolve and so it will sit and wait for the entire 30 minute refresh frequency before attempting to resolve the names again.

A concrete example of this is when updating the images on a kubernetes deployment. The known connections are all closed before the kubernetes DNS has been updated with the new pod ip addresses.

The upshot of this is that if for some reason all instances of a server need to be stopped at the same time then any client using them will also need to be restarted when they come back up, unless it's OK to wait for 30 minutes for the resolver to query the DNS again, which in most cases it won't be.

It's possible that there is some client side setting I've missed which is the root cause of this. I've tried the client keep alive configuration, but that doesn't change anything, which makes sense because in this case there aren't any connections to keep alive. Backoff config already has defaults set and the context in dial only applies to when dial is initially called as far as I can see.

I'm able to work around this in my case with the minReadySeconds configuration on the kubernetes deployment but I figured it was worth brining to your attention.

@menghanl
Copy link
Contributor

This is a valid concern, thanks for bringing this up.
And there's actually one more case where this could happen.

This happens when there's no retry happening in the ClientConn, could be caused by one of the following reasons:

  1. Resolver returned an empty list
  2. All addrConns stopped retrying because of non-temporary errors

For "Resolver returned an empty list". It could be a valid address list because the resolver may want to delete the previous returned list. To solve the re-resolve problem, there are two possible solutions:

  1. The resolver keeps retry itself (This would be a cleaner solution because some resolvers don't care because they don't pull from the server)
  2. ClientConn start a goroutine to trigger it (This would end up with something like a retry with exponential backoff)

For "All addrConns stopped for non-temporary errors". I think the solution would be to keep retrying on those connections. (Filed #1856 for this)

@robotlovesyou
Copy link
Author

Thanks

@vadimi
Copy link

vadimi commented Jun 21, 2018

I'm having exactly the same issue with dns resolver. If my VPN connectivity drops for some time it takes about 30 mins for my client to reconnect. I'm seeing this line in the log and after that ClientConn state changes to IDLE:

INFO: 2018/06/21 00:30:25 ccResolverWrapper: sending new addresses to cc: []

When I use passthrough scheme reconnects happen way faster.

Is there a plan to fix "1. The resolver keeps retry itself"?

@menghanl menghanl changed the title DNS Resolver with MaxConnectionAge can get stuck with no connections DNS Resolver can get stuck with no connections Jul 2, 2018
@menghanl menghanl assigned lyuxuan and unassigned menghanl Jul 3, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jan 9, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants