DNS Resolver can get stuck with no connections #1795

robotlovesyou · 2018-01-15T11:56:05Z

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.9.1

What version of Go are you using (`go version`)?

1.9

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I'm using a RoundRobin load balancer with the DNS resolver and the MaxConnectionAge set as per the discussion in #1663. All is well and good unless for some reason ALL the available servers are stopped and the resolver gets "no such host" for the given name for a period of time, When that happens the dns resolver gets stuck. It has no connections closing due to MaxConnection age to prompt it to re-resolve and so it will sit and wait for the entire 30 minute refresh frequency before attempting to resolve the names again.

A concrete example of this is when updating the images on a kubernetes deployment. The known connections are all closed before the kubernetes DNS has been updated with the new pod ip addresses.

The upshot of this is that if for some reason all instances of a server need to be stopped at the same time then any client using them will also need to be restarted when they come back up, unless it's OK to wait for 30 minutes for the resolver to query the DNS again, which in most cases it won't be.

It's possible that there is some client side setting I've missed which is the root cause of this. I've tried the client keep alive configuration, but that doesn't change anything, which makes sense because in this case there aren't any connections to keep alive. Backoff config already has defaults set and the context in dial only applies to when dial is initially called as far as I can see.

I'm able to work around this in my case with the minReadySeconds configuration on the kubernetes deployment but I figured it was worth brining to your attention.

The text was updated successfully, but these errors were encountered:

menghanl · 2018-02-12T23:26:14Z

This is a valid concern, thanks for bringing this up.
And there's actually one more case where this could happen.

This happens when there's no retry happening in the ClientConn, could be caused by one of the following reasons:

Resolver returned an empty list
All addrConns stopped retrying because of non-temporary errors

For "Resolver returned an empty list". It could be a valid address list because the resolver may want to delete the previous returned list. To solve the re-resolve problem, there are two possible solutions:

The resolver keeps retry itself (This would be a cleaner solution because some resolvers don't care because they don't pull from the server)
ClientConn start a goroutine to trigger it (This would end up with something like a retry with exponential backoff)

For "All addrConns stopped for non-temporary errors". I think the solution would be to keep retrying on those connections. (Filed #1856 for this)

robotlovesyou · 2018-02-21T15:30:21Z

Thanks

vadimi · 2018-06-21T13:59:19Z

I'm having exactly the same issue with dns resolver. If my VPN connectivity drops for some time it takes about 30 mins for my client to reconnect. I'm seeing this line in the log and after that ClientConn state changes to IDLE:

INFO: 2018/06/21 00:30:25 ccResolverWrapper: sending new addresses to cc: []

When I use passthrough scheme reconnects happen way faster.

Is there a plan to fix "1. The resolver keeps retry itself"?

dfawley added Type: Bug P2 labels Jan 16, 2018

dfawley assigned menghanl Jan 16, 2018

menghanl mentioned this issue Mar 26, 2018

addrConn/ClientConn cleanup #1742

Closed

6 tasks

menghanl changed the title ~~DNS Resolver with MaxConnectionAge can get stuck with no connections~~ DNS Resolver can get stuck with no connections Jul 2, 2018

menghanl mentioned this issue Jul 2, 2018

It is impossible to set a custom polling frequency in DNS resolver #1663

Closed

menghanl assigned lyuxuan and unassigned menghanl Jul 3, 2018

lyuxuan mentioned this issue Jul 3, 2018

resolver/dns: exponential retry when getting empty address list #2201

Merged

lyuxuan closed this as completed in #2201 Jul 13, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS Resolver can get stuck with no connections #1795

DNS Resolver can get stuck with no connections #1795

robotlovesyou commented Jan 15, 2018 •

edited

Loading

menghanl commented Feb 12, 2018

robotlovesyou commented Feb 21, 2018

vadimi commented Jun 21, 2018

DNS Resolver can get stuck with no connections #1795

DNS Resolver can get stuck with no connections #1795

Comments

robotlovesyou commented Jan 15, 2018 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

menghanl commented Feb 12, 2018

robotlovesyou commented Feb 21, 2018

vadimi commented Jun 21, 2018

robotlovesyou commented Jan 15, 2018 •

edited

Loading

What version of Go are you using (`go version`)?