-
Notifications
You must be signed in to change notification settings - Fork 238
Agent stuck with log "resolver returned no addresses" #425
Comments
We are facing the same issue as above:
kiam v3.6 And yes, killing agent pods seemed to resolve the issue (for now) |
The error means that the agent pod's gRPC client is unable to find any available server pods for it to connect to. It uses gRPC's client load-balancer. There's some gRPC environment variables you can add for it to report debugging information as to what's going on but the problem is that there's no running server pods and it wasn't able to find them in time. |
We are facing this issue as well. All kiam agents became unresponsive until a restart. @pingles |
It depends where it happens. Most operations are set to retry in the event of failures (assuming some intermittent pod issues or DNS resolution for example) up to an amount of time or the operation is cancelled. Other issues have highlighted similar problems and you can enable some env vars to increase the verbosity of gRPC's logging. #217 (comment) Could you try that and see what that shows please. |
I am also observing this problem. Here are some detailed logs: Logs from kiam-agent before problem:
As you can see, it has started and got 10.121.64.56 and 10.121.10.175 as kiam-server pod ips. This matches reality:
I then scale down kiam-server to 0, and get these logs from the same kiam-agent:
After the SHUTDOWN message, no more detailed logs like this. I then scale up kiam-server again, note new pod ips:
Port-forwarding and hitting health endpoint on the kiam-agent provokes following http response and error in log:
Note that the pod ip in the error message is one of the original ips from before the kiam-server scale down. For me it looks like gRPC stops receiving or polling for updated endpoint IPs after 30 seconds of failures. versions: |
Interesting, thanks for adding the detailed info. |
I'll try and take a look when I can but if anyone else can help I'm happy for us to take a look at PRs too. |
So I don't have a smoking gun, but I have some observations that might be helpful in zeroing in. So kiam uses the dns resolver. In grpc < 1.28.0 it seems like the functioning of this was that it re-resolved DNS on a connection error, and every 30 minutes (I don't have a hard reference for this, just a lot of snippets from different discussions around on different sites). In the case of a connection's subconnections all being in the SHUTDOWN state, there never was a state transition triggered to do a re-resolution. This was a bug reported in 1.26 (grpc/grpc-go#3353) This bug was never resolved per se, but in 1.28 (grpc/grpc-go#3165) they removed DNS polling all together, and also made the re-poll on transitions more robust. I tried updating this dependency locally (go get google.golang.org/[email protected]) and I can NOT consistently reproduce the same issue now. I've tried my reproduction routing about 5 times now, and it I've only seen it 1 time. The other 4 times the SHUTDOWN message never came and kiam-agent continued trying to connect to the old pod IPs. I also tried with the newest grpc (1.33.1) and it seems to behave better there also. Here's an annotated example:
Here I remove the kiam server pods
Here i start the kiam-server pods again, at about 16:38:00
At this point kiam-agent has been "down" for about 30-60 seoncds after kiam-server pods has started. Now kiam-agent grpc seems to refresh from DNS
In short, it seems that grpc 1.28 or newer at least doesn't fully hang when all kiam-servers are gone at the same time, recovery does take some time though. |
Fantastic!
I had started looking at refactoring some of the server code so it was
easier to test and then put in a test for this situation to reliably
reproduce. I’ll try that and see if it does and then look into using the
updated grpc lib.
Thanks for helping to figure it out!
…On Thu, 22 Oct 2020 at 00:23, Dag Viggo Lokøen ***@***.***> wrote:
So I don't have a smoking gun, but I have some observations that might be
helpful in zeroing in.
So kiam uses the dns resolver. In grpc < 1.28.0 it seems like the
functioning of this was that it re-resolved DNS on a connection error, and
every 30 minutes (I don't have a hard reference for this, just a lot of
snippets from different discussions around on different sites). In the case
of a connection's subconnections all being in the SHUTDOWN state, there
never was a state transition triggered to do a re-resolution. This was a
bug reported in 1.26 (grpc/grpc-go#3353
<grpc/grpc-go#3353>)
This bug was never resolved per se, but in 1.28 (grpc/grpc-go#3165
<grpc/grpc-go#3165>) they removed DNS polling all
together, and also made the re-poll on transitions more robust.
I tried updating this dependency locally (go get
***@***.***) and I can NOT consistently reproduce the
same issue now. I've tried my reproduction routing about 5 times now, and
it I've only seen it 1 time. The other 4 times the SHUTDOWN message never
came and kiam-agent continued trying to connect to the old pod IPs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#425 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAI7S5V56LTPHAKVHFF5TSL5UPDANCNFSM4ROSMUDQ>
.
|
* Created a Kiam Server integration test to make it easier to test * Extracted server and gateway into builders, tidies complicated construction and makes it easier to test * Move the cacheSize metric from a counter to a gauge and define in metrics.go and outside of DefaultCache func, removes duplicate panic in tests * KiamGatewayBuilder adds WithMaxRetries to configure the gRPC retry behaviour (by default, doesn't retry operations) and potentially helps address #425
I've managed to get a few refactorings done (work in progress in #432). I wanted to get to a place where it's easier to write some tests and prove the behaviour. While doing that I've found that the retry behaviour in the gRPC interceptor that's used by the client gateway doesn't set a number of retries, so the underlying operations aren't retried. 4d7c245#diff-947f40a8d50a4ea75b9feffc9c0b859661bdf9372582240a4b2e4be20f144368R120 is the key in question. I don't think it's the same issue you've identified, but it may be connected. I'm going to try and see if I can extend the tests to replicate the same situation you've documented before upgrading the gRPC release to confirm it fixes. Not finished but thought it was worth sharing progress and what I've learned this evening 😄 |
Adding a comment here to help track my notes.
Running the tests to try and mimic the behaviour (shutting down a server after making a request) shows the following:
|
I'm reading through the links you shared @dagvl (thanks again). This looks like a relevant one also: grpc/grpc#12295 It reads like there was a bug that was introduced, but also there's some recommendation about changing the server to set a maximum connection age; once connections are closed the names are updated. Maybe it's worth us adding some configuration to the server for that? |
* Improve region endpoint resolver Pull out into separate type and add some more tests about handling of gov, fips and chinese regions. Additionally removes comparison against SDK regions and instead relies on DNS resolution to verify. * Removed most of the region config setup from sts.DefaultGateway into a configBuilder, added more tests around configBuilder to confirm behaviour * Changed server to request server credentials with the server assume role after configuring for region, should address #368 * Regional endpoint adds a us-iso prefix to handle airgapped regions addressing #410 * Updated version of AWS SDK to 1.35 * Add server tests * Refactorings and more tests * Created a Kiam Server integration test to make it easier to test * Extracted server and gateway into builders, tidies complicated construction and makes it easier to test * Move the cacheSize metric from a counter to a gauge and define in metrics.go and outside of DefaultCache func, removes duplicate panic in tests * KiamGatewayBuilder adds WithMaxRetries to configure the gRPC retry behaviour (by default, doesn't retry operations) and potentially helps address #425
So I don't forget, I think the remaining work for this issue is to try and get a failing test written ( As described in pomerium/pomerium#509
|
As an addition to this (different circumstances but I'm pretty sure it's the same thing happening), my company has recently been doing some chaos testing in which we manually partition a given AZ (by adding |
i seem to get the same issue on the cluster. Redeploying the agent.yaml fixes the issue
Has anyone found a workaround for this one.? Any recommendation is really appreciated |
@QuinnStevens that's my expectation and the advice from the gRPC team in their project. We're close to releasing v4 (I'm going to make #449 the only thing we do prior to release) which will go out in another beta for us to test against but it should be production-ready. There's a few breaking changes in v4 noted in an upgrade doc but if you're keen to give it a go, the current beta would be a good option. Indeed, it'd be great to see how it behaves in your chaos experiments. |
We've updated gRPC and the changes in #433 should make it possible to mitigate the situation, according to the advice from the gRPC team: controlling the connection age to be shorter forces clients to re-poll for servers frequently. I'm going to close this issue for now and people should follow that advice to manage. If we see other problems we can re-open. Thanks for everyone in contributing and helping us to close this down. It was also reported in #217. |
We are currently running into this issue in v3.6. Just to confirm, it seems that the issue has been resolved, but only patched to v4 is that correct? |
Some of the kiam-agent does not return metadata & emit below logs. We also monitored network activity of agent it never makes new DNS query.
Agent logs:
{"level":"warning","msg":"error finding role for pod: rpc error: code = Unavailable desc = resolver returned no addresses","pod.ip":"100.121.119.112","time":"2020-09-16T09:51:07Z"}
To fix this we delete kiam-agent pod.
The text was updated successfully, but these errors were encountered: