Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Agent stuck with log "resolver returned no addresses" #425

Closed
avdhoot opened this issue Sep 16, 2020 · 20 comments
Closed

Agent stuck with log "resolver returned no addresses" #425

avdhoot opened this issue Sep 16, 2020 · 20 comments
Labels
Milestone

Comments

@avdhoot
Copy link

avdhoot commented Sep 16, 2020

Some of the kiam-agent does not return metadata & emit below logs. We also monitored network activity of agent it never makes new DNS query.

Agent logs:
{"level":"warning","msg":"error finding role for pod: rpc error: code = Unavailable desc = resolver returned no addresses","pod.ip":"100.121.119.112","time":"2020-09-16T09:51:07Z"}

To fix this we delete kiam-agent pod.

@BartoszZawadzki
Copy link

We are facing the same issue as above:

kiam-agent-2vb9x kiam-agent {"addr":"100.96.1.231:44236","level":"error","method":"GET","msg":"error processing request: rpc error: code = Unavailable desc = resolver returned no addresses","path":"/latest/meta-data/iam/security-credentials/","status":500,"time":"2020-10-14T12:48:06Z"}

kiam v3.6
kubernetesVersion: 1.18.9
kops 1.18.1 (git-453d7d96be)

And yes, killing agent pods seemed to resolve the issue (for now)

@pingles
Copy link
Contributor

pingles commented Oct 16, 2020

The error means that the agent pod's gRPC client is unable to find any available server pods for it to connect to. It uses gRPC's client load-balancer. There's some gRPC environment variables you can add for it to report debugging information as to what's going on but the problem is that there's no running server pods and it wasn't able to find them in time.

@dza89
Copy link

dza89 commented Oct 20, 2020

We are facing this issue as well. All kiam agents became unresponsive until a restart.
kiam v3.6
kubernetesVersion: 1.17.eks.3

@pingles
That makes sense, but shouldn't it lead to an application exit if it's 'stuck'?

@pingles
Copy link
Contributor

pingles commented Oct 20, 2020

It depends where it happens. Most operations are set to retry in the event of failures (assuming some intermittent pod issues or DNS resolution for example) up to an amount of time or the operation is cancelled.

Other issues have highlighted similar problems and you can enable some env vars to increase the verbosity of gRPC's logging. #217 (comment)

Could you try that and see what that shows please.

@dagvl
Copy link

dagvl commented Oct 20, 2020

I am also observing this problem. Here are some detailed logs:

Logs from kiam-agent before problem:

➜ k logs -f kiam-agent-5n76g
{"level":"info","msg":"configuring iptables","time":"2020-10-20T21:58:12Z"}
{"level":"info","msg":"started prometheus metric listener 0.0.0.0:9620","time":"2020-10-20T21:58:12Z"}
INFO: 2020/10/20 21:58:12 parsed scheme: "dns"
INFO: 2020/10/20 21:58:12 ccResolverWrapper: sending update to cc: {[{10.121.64.56:443  <nil> 0 <nil>} {10.121.10.175:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/20 21:58:12 base.baseBalancer: got new ClientConn state:  {{[{10.121.64.56:443  <nil> 0 <nil>} {10.121.10.175:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/20 21:58:12 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 21:58:12 base.baseBalancer: handle SubConn state change: 0xc0000aed90, CONNECTING
INFO: 2020/10/20 21:58:12 base.baseBalancer: handle SubConn state change: 0xc0000aed70, READY

As you can see, it has started and got 10.121.64.56 and 10.121.10.175 as kiam-server pod ips. This matches reality:

➜ kgp -o wide | grep kiam-server
kiam-server-569cbfc89f-pc6t6                                 1/1     Running     0          3m26s   10.121.64.56     ip-10-121-105-32.eu-north-1.compute.internal    <none>           <none>
kiam-server-569cbfc89f-s2n88                                 1/1     Running     0          3m26s   10.121.10.175    ip-10-121-17-163.eu-north-1.compute.internal    <none>           <none>

I then scale down kiam-server to 0, and get these logs from the same kiam-agent:

INFO: 2020/10/20 21:59:50 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 21:59:50 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/10/20 21:59:50 roundrobinPicker: newPicker called with info: {map[0xc0000aed90:{{10.121.10.175:443  <nil> 0 <nil>}}]}
WARNING: 2020/10/20 21:59:50 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: connection refused". Reconnecting...
INFO: 2020/10/20 21:59:50 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 21:59:50 ccResolverWrapper: sending update to cc: {[{10.121.64.56:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/20 21:59:50 base.baseBalancer: got new ClientConn state:  {{[{10.121.64.56:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/20 21:59:50 base.baseBalancer: handle SubConn state change: 0xc0000aed90, SHUTDOWN
INFO: 2020/10/20 21:59:50 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/10/20 21:59:51 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 21:59:51 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/20 22:00:01 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host". Reconnecting...
INFO: 2020/10/20 22:00:01 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 22:00:03 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 22:00:03 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/20 22:00:06 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host". Reconnecting...
INFO: 2020/10/20 22:00:06 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 22:00:08 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 22:00:08 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/20 22:00:09 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host". Reconnecting...
INFO: 2020/10/20 22:00:09 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 22:00:13 base.baseBalancer: handle SubConn state change: 0xc0000aed70, CONNECTING
INFO: 2020/10/20 22:00:13 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/20 22:00:15 grpc: addrConn.createTransport failed to connect to {10.121.64.56:443  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host". Reconnecting...
INFO: 2020/10/20 22:00:15 base.baseBalancer: handle SubConn state change: 0xc0000aed70, TRANSIENT_FAILURE
INFO: 2020/10/20 22:00:20 ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/10/20 22:00:20 base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}
INFO: 2020/10/20 22:00:20 base.baseBalancer: handle SubConn state change: 0xc0000aed70, SHUTDOWN

After the SHUTDOWN message, no more detailed logs like this.

I then scale up kiam-server again, note new pod ips:

➜ kgp -o wide | grep kiam-server
kiam-server-569cbfc89f-bq7nx                                 1/1     Running     0          43s     10.121.40.62     ip-10-121-17-163.eu-north-1.compute.internal    <none>           <none>
kiam-server-569cbfc89f-qz2pl                                 1/1     Running     0          43s     10.121.103.122   ip-10-121-105-32.eu-north-1.compute.internal    <none>           <none>

Port-forwarding and hitting health endpoint on the kiam-agent provokes following http response and error in log:

➜ curl -i "http://localhost:8181/health?deep=1"
HTTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Tue, 20 Oct 2020 22:01:11 GMT
Content-Length: 148

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host"
{"addr":"127.0.0.1:50808","level":"error","method":"GET","msg":"error processing request: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 10.121.64.56:443: connect: no route to host\"","path":"/health","status":500,"time":"2020-10-20T22:01:11Z"}
{"addr":"127.0.0.1:50808","duration":5000,"headers":{"Content-Type":["text/plain; charset=utf-8"],"X-Content-Type-Options":["nosniff"]},"level":"info","method":"GET","msg":"processed request","path":"/health","status":500,"time":"2020-10-20T22:01:11Z"}

Note that the pod ip in the error message is one of the original ips from before the kiam-server scale down.

For me it looks like gRPC stops receiving or polling for updated endpoint IPs after 30 seconds of failures.

versions:
k8s: v1.15.11-eks-065dce
kiam: quay.io/uswitch/kiam:v3.6

@pingles
Copy link
Contributor

pingles commented Oct 21, 2020

Interesting, thanks for adding the detailed info.

@pingles
Copy link
Contributor

pingles commented Oct 21, 2020

I'll try and take a look when I can but if anyone else can help I'm happy for us to take a look at PRs too.

@pingles pingles added the bug label Oct 21, 2020
@pingles pingles added this to the v4 milestone Oct 21, 2020
@dagvl
Copy link

dagvl commented Oct 21, 2020

So I don't have a smoking gun, but I have some observations that might be helpful in zeroing in.

So kiam uses the dns resolver. In grpc < 1.28.0 it seems like the functioning of this was that it re-resolved DNS on a connection error, and every 30 minutes (I don't have a hard reference for this, just a lot of snippets from different discussions around on different sites). In the case of a connection's subconnections all being in the SHUTDOWN state, there never was a state transition triggered to do a re-resolution. This was a bug reported in 1.26 (grpc/grpc-go#3353)

This bug was never resolved per se, but in 1.28 (grpc/grpc-go#3165) they removed DNS polling all together, and also made the re-poll on transitions more robust.

I tried updating this dependency locally (go get google.golang.org/[email protected]) and I can NOT consistently reproduce the same issue now. I've tried my reproduction routing about 5 times now, and it I've only seen it 1 time. The other 4 times the SHUTDOWN message never came and kiam-agent continued trying to connect to the old pod IPs.

I also tried with the newest grpc (1.33.1) and it seems to behave better there also. Here's an annotated example:

INFO: 2020/10/21 16:37:05 [core] parsed scheme: "dns"
INFO: 2020/10/21 16:37:05 [core] ccResolverWrapper: sending update to cc: {[{10.121.40.62:443  <nil> 0 <nil>} {10.121.96.28:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: got new ClientConn state:  {{[{10.121.40.62:443  <nil> 0 <nil>} {10.121.96.28:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/21 16:37:05 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:05 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:05 [core] Subchannel picks a new address "10.121.96.28:443" to connect
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
INFO: 2020/10/21 16:37:05 [core] Channel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, CONNECTING
INFO: 2020/10/21 16:37:05 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:05 [core] CPU time info is unavailable on non-linux or appengine environment.
INFO: 2020/10/21 16:37:05 [core] Subchannel Connectivity change to READY
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, READY
INFO: 2020/10/21 16:37:05 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc000337590:{{10.121.96.28:443  <nil> 0 <nil>}}]}
INFO: 2020/10/21 16:37:05 [core] Channel Connectivity change to READY
INFO[0000] listening :8181
INFO: 2020/10/21 16:37:05 [core] Subchannel Connectivity change to READY
INFO: 2020/10/21 16:37:05 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, READY
INFO: 2020/10/21 16:37:05 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc000337530:{{10.121.40.62:443  <nil> 0 <nil>}} 0xc000337590:{{10.121.96.28:443  <nil> 0 <nil>}}]}
INFO: 2020/10/21 16:37:19 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:19 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
INFO: 2020/10/21 16:37:19 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:19 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc000337590:{{10.121.96.28:443  <nil> 0 <nil>}}]}

Here I remove the kiam server pods

INFO: 2020/10/21 16:37:19 [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
WARNING: 2020/10/21 16:37:19 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:19 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:19 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:20 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:20 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:20 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
WARNING: 2020/10/21 16:37:23 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:23 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:23 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:24 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:24 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:24 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
WARNING: 2020/10/21 16:37:26 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:26 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:26 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:26 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:26 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, CONNECTING
INFO: 2020/10/21 16:37:26 [roundrobin] roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/10/21 16:37:26 [core] Subchannel picks a new address "10.121.96.28:443" to connect
INFO: 2020/10/21 16:37:26 [core] Channel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:26 [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"
WARNING: 2020/10/21 16:37:26 [core] grpc: addrConn.createTransport failed to connect to {10.121.96.28:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:26 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:26 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:26 [core] Channel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:27 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:27 [core] Subchannel picks a new address "10.121.96.28:443" to connect
INFO: 2020/10/21 16:37:27 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, CONNECTING
INFO: 2020/10/21 16:37:28 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:28 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:28 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
WARNING: 2020/10/21 16:37:30 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:30 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:30 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
WARNING: 2020/10/21 16:37:30 [core] grpc: addrConn.createTransport failed to connect to {10.121.96.28:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:30 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:30 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:31 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:31 [core] Subchannel picks a new address "10.121.96.28:443" to connect
INFO: 2020/10/21 16:37:31 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, CONNECTING
INFO: 2020/10/21 16:37:33 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:37:33 [core] Subchannel picks a new address "10.121.40.62:443" to connect
INFO: 2020/10/21 16:37:33 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, CONNECTING
WARNING: 2020/10/21 16:37:35 [core] grpc: addrConn.createTransport failed to connect to {10.121.40.62:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: EOF". Reconnecting...
INFO: 2020/10/21 16:37:35 [core] Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:35 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, TRANSIENT_FAILURE
INFO: 2020/10/21 16:37:35 [core] ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/10/21 16:37:35 [balancer] base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}
INFO: 2020/10/21 16:37:35 [core] Subchannel Connectivity change to SHUTDOWN
INFO: 2020/10/21 16:37:35 [core] Subchannel Connectivity change to SHUTDOWN
INFO: 2020/10/21 16:37:35 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337530, SHUTDOWN
INFO: 2020/10/21 16:37:35 [balancer] base.baseBalancer: handle SubConn state change: 0xc000337590, SHUTDOWN
WARNING: 2020/10/21 16:37:35 [core] grpc: addrConn.createTransport failed to connect to {10.121.96.28:443 kiam-server.ops <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...

Here i start the kiam-server pods again, at about 16:38:00

INFO: 2020/10/21 16:38:05 [core] ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/10/21 16:38:05 [balancer] base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}
INFO: 2020/10/21 16:38:35 [core] ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/10/21 16:38:35 [balancer] base.baseBalancer: got new ClientConn state:  {{[] <nil> <nil>} <nil>}

At this point kiam-agent has been "down" for about 30-60 seoncds after kiam-server pods has started. Now kiam-agent grpc seems to refresh from DNS

INFO: 2020/10/21 16:39:06 [core] ccResolverWrapper: sending update to cc: {[{10.121.36.179:443  <nil> 0 <nil>} {10.121.64.56:443  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: got new ClientConn state:  {{[{10.121.36.179:443  <nil> 0 <nil>} {10.121.64.56:443  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/21 16:39:06 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:39:06 [core] Subchannel Connectivity change to CONNECTING
INFO: 2020/10/21 16:39:06 [core] Subchannel picks a new address "10.121.64.56:443" to connect
INFO: 2020/10/21 16:39:06 [core] Subchannel picks a new address "10.121.36.179:443" to connect
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: handle SubConn state change: 0xc0001170e0, CONNECTING
INFO: 2020/10/21 16:39:06 [core] Channel Connectivity change to CONNECTING
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: handle SubConn state change: 0xc0001171d0, CONNECTING
INFO: 2020/10/21 16:39:06 [core] Subchannel Connectivity change to READY
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: handle SubConn state change: 0xc0001171d0, READY
INFO: 2020/10/21 16:39:06 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc0001171d0:{{10.121.64.56:443  <nil> 0 <nil>}}]}
INFO: 2020/10/21 16:39:06 [core] Channel Connectivity change to READY
INFO: 2020/10/21 16:39:06 [core] Subchannel Connectivity change to READY
INFO: 2020/10/21 16:39:06 [balancer] base.baseBalancer: handle SubConn state change: 0xc0001170e0, READY
INFO: 2020/10/21 16:39:06 [roundrobin] roundrobinPicker: newPicker called with info: {map[0xc0001170e0:{{10.121.36.179:443  <nil> 0 <nil>}} 0xc0001171d0:{{10.121.64.56:443  <nil> 0 <nil>}}]}

In short, it seems that grpc 1.28 or newer at least doesn't fully hang when all kiam-servers are gone at the same time, recovery does take some time though.

@pingles
Copy link
Contributor

pingles commented Oct 22, 2020 via email

pingles added a commit that referenced this issue Oct 22, 2020
* Created a Kiam Server integration test to make it easier to test
* Extracted server and gateway into builders, tidies complicated construction and makes it easier to test
* Move the cacheSize metric from a counter to a gauge and define in metrics.go and outside of DefaultCache func, removes duplicate panic in tests
* KiamGatewayBuilder adds WithMaxRetries to configure the gRPC retry behaviour (by default, doesn't retry operations) and potentially helps address #425
@pingles
Copy link
Contributor

pingles commented Oct 22, 2020

I've managed to get a few refactorings done (work in progress in #432). I wanted to get to a place where it's easier to write some tests and prove the behaviour.

While doing that I've found that the retry behaviour in the gRPC interceptor that's used by the client gateway doesn't set a number of retries, so the underlying operations aren't retried. 4d7c245#diff-947f40a8d50a4ea75b9feffc9c0b859661bdf9372582240a4b2e4be20f144368R120 is the key in question.

I don't think it's the same issue you've identified, but it may be connected.

I'm going to try and see if I can extend the tests to replicate the same situation you've documented before upgrading the gRPC release to confirm it fixes.

Not finished but thought it was worth sharing progress and what I've learned this evening 😄

@pingles
Copy link
Contributor

pingles commented Oct 22, 2020

Adding a comment here to help track my notes.

% GRPC_GO_LOG_SEVERITY_LEVEL=info GRPC_GO_LOG_VERBOSITY_LEVEL=8 go test github.com/uswitch/kiam/pkg/... -run TestRetriesUntilServerAvailable

Running the tests to try and mimic the behaviour (shutting down a server after making a request) shows the following:

time="2020-10-22T22:26:31+01:00" level=info msg="started cache controller"
time="2020-10-22T22:26:31+01:00" level=info msg="started namespace cache controller"
WARNING: 2020/10/22 22:26:31 Adjusting keepalive ping interval to minimum period of 10s
INFO: 2020/10/22 22:26:31 parsed scheme: "dns"
time="2020-10-22T22:26:31+01:00" level=info msg="started cache controller"
INFO: 2020/10/22 22:26:31 ccResolverWrapper: sending update to cc: {[{127.0.0.1:8899  <nil> 0 <nil>} {[::1]:8899  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/22 22:26:31 base.baseBalancer: got new ClientConn state:  {{[{127.0.0.1:8899  <nil> 0 <nil>} {[::1]:8899  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/22 22:26:31 base.baseBalancer: handle SubConn state change: 0xc0003023b0, CONNECTING
INFO: 2020/10/22 22:26:31 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
INFO: 2020/10/22 22:26:31 CPU time info is unavailable on non-linux or appengine environment.
WARNING: 2020/10/22 22:26:31 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:31 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
time="2020-10-22T22:26:31+01:00" level=info msg="started namespace cache controller"
time="2020-10-22T22:26:32+01:00" level=info msg=listening
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc0003023b0, READY
INFO: 2020/10/22 22:26:32 roundrobinPicker: newPicker called with info: {map[0xc0003023b0:{{127.0.0.1:8899  <nil> 0 <nil>}}]}
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc0003023b0, CONNECTING
INFO: 2020/10/22 22:26:32 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/10/22 22:26:32 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/10/22 22:26:32 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
time="2020-10-22T22:26:32+01:00" level=info msg="started cache controller"
WARNING: 2020/10/22 22:26:32 grpc: addrConn.createTransport failed to connect to {127.0.0.1:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc0003023b0, TRANSIENT_FAILURE
time="2020-10-22T22:26:32+01:00" level=info msg="started namespace cache controller"
time="2020-10-22T22:26:32+01:00" level=info msg="started cache controller"
time="2020-10-22T22:26:32+01:00" level=info msg="started namespace cache controller"
time="2020-10-22T22:26:32+01:00" level=info msg=listening
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
INFO: 2020/10/22 22:26:32 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/10/22 22:26:32 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:32 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:33 base.baseBalancer: handle SubConn state change: 0xc0003023b0, CONNECTING
INFO: 2020/10/22 22:26:33 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/10/22 22:26:33 base.baseBalancer: handle SubConn state change: 0xc0003023b0, READY
INFO: 2020/10/22 22:26:33 roundrobinPicker: newPicker called with info: {map[0xc0003023b0:{{127.0.0.1:8899  <nil> 0 <nil>}}]}
INFO: 2020/10/22 22:26:34 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:34 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:34 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:36 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:36 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:36 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:40 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:40 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:40 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:46 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:46 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:46 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:26:58 base.baseBalancer: handle SubConn state change: 0xc000302450, CONNECTING
WARNING: 2020/10/22 22:26:58 grpc: addrConn.createTransport failed to connect to {[::1]:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp [::1]:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:26:58 base.baseBalancer: handle SubConn state change: 0xc000302450, TRANSIENT_FAILURE
INFO: 2020/10/22 22:27:01 ccResolverWrapper: sending update to cc: {[{127.0.0.1:8899  <nil> 0 <nil>} {[::1]:8899  <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/10/22 22:27:01 base.baseBalancer: got new ClientConn state:  {{[{127.0.0.1:8899  <nil> 0 <nil>} {[::1]:8899  <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/10/22 22:27:02 base.baseBalancer: handle SubConn state change: 0xc0003023b0, CONNECTING
INFO: 2020/10/22 22:27:02 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/10/22 22:27:02 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/10/22 22:27:02 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
WARNING: 2020/10/22 22:27:02 grpc: addrConn.createTransport failed to connect to {127.0.0.1:8899  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:8899: connect: connection refused". Reconnecting...
INFO: 2020/10/22 22:27:02 base.baseBalancer: handle SubConn state change: 0xc0003023b0, TRANSIENT_FAILURE
--- FAIL: TestRetriesUntilServerAvailable (30.62s)
    server_integration_test.go:85: didn't complete in 30 seconds

@pingles
Copy link
Contributor

pingles commented Oct 22, 2020

I'm reading through the links you shared @dagvl (thanks again). This looks like a relevant one also: grpc/grpc#12295

It reads like there was a bug that was introduced, but also there's some recommendation about changing the server to set a maximum connection age; once connections are closed the names are updated. Maybe it's worth us adding some configuration to the server for that?

@pingles
Copy link
Contributor

pingles commented Oct 22, 2020

pingles added a commit that referenced this issue Oct 23, 2020
* Improve region endpoint resolver
Pull out into separate type and add some more tests about handling of gov, fips and chinese regions.
Additionally removes comparison against SDK regions and instead relies on DNS resolution to verify.
* Removed most of the region config setup from sts.DefaultGateway into a configBuilder, added more tests around configBuilder to confirm behaviour
* Changed server to request server credentials with the server assume role after configuring for region, should address #368
* Regional endpoint adds a us-iso prefix to handle airgapped regions addressing #410
* Updated version of AWS SDK to 1.35
* Add server tests
* Refactorings and more tests
* Created a Kiam Server integration test to make it easier to test
* Extracted server and gateway into builders, tidies complicated construction and makes it easier to test
* Move the cacheSize metric from a counter to a gauge and define in metrics.go and outside of DefaultCache func, removes duplicate panic in tests
* KiamGatewayBuilder adds WithMaxRetries to configure the gRPC retry behaviour (by default, doesn't retry operations) and potentially helps address #425
@pingles
Copy link
Contributor

pingles commented Nov 12, 2020

So I don't forget, I think the remaining work for this issue is to try and get a failing test written (server_integration_test.go was the starting point) with the current v1.27 gRPC release and see if it improves with v1.28. Additionally, the gRPC advice is to configure the keep-alive setting.

As described in pomerium/pomerium#509

As of v0.26, go-grpc no longer polls DNS regularly for endpoint updates. Discovery is only performed when a client detects a state change. This is triggered during error conditions, but routine scaling operations in an otherwise functioning environment may leave endpoints underutilized. To ensure we occasionally re-discover endpoints even when no connection errors have been encountered, the recommended pattern is to set server side options which gracefully terminate client connections at regular intervals.

@pingles
Copy link
Contributor

pingles commented Nov 12, 2020

@pingles pingles mentioned this issue Dec 1, 2020
@QuantumBJump
Copy link

As an addition to this (different circumstances but I'm pretty sure it's the same thing happening), my company has recently been doing some chaos testing in which we manually partition a given AZ (by adding DENY * to the NetACLs), and this same problem seems to occur for all kiam-agent pods on instances in the affected AZ. I'm assuming this is caused similarly by the lack of a SHUTDOWN message being received, and would be fixed by configuring the keep-alive?

@vishalmaheswari
Copy link

i seem to get the same issue on the cluster. Redeploying the agent.yaml fixes the issue

{"level":"warning","msg":"error finding role for pod: rpc error: code = Unavailable desc = resolver returned no addresses","pod.ip":"10.202.24.186","time":"2020-12-16T04:07:49Z"}

Has anyone found a workaround for this one.? Any recommendation is really appreciated

@pingles
Copy link
Contributor

pingles commented Dec 16, 2020

@QuinnStevens that's my expectation and the advice from the gRPC team in their project.

We're close to releasing v4 (I'm going to make #449 the only thing we do prior to release) which will go out in another beta for us to test against but it should be production-ready. There's a few breaking changes in v4 noted in an upgrade doc but if you're keen to give it a go, the current beta would be a good option. Indeed, it'd be great to see how it behaves in your chaos experiments.

@pingles
Copy link
Contributor

pingles commented Dec 16, 2020

We've updated gRPC and the changes in #433 should make it possible to mitigate the situation, according to the advice from the gRPC team: controlling the connection age to be shorter forces clients to re-poll for servers frequently.

I'm going to close this issue for now and people should follow that advice to manage. If we see other problems we can re-open.

Thanks for everyone in contributing and helping us to close this down. It was also reported in #217.

@pingles pingles closed this as completed Dec 16, 2020
@limianwang
Copy link

limianwang commented Jun 22, 2021

We are currently running into this issue in v3.6. Just to confirm, it seems that the issue has been resolved, but only patched to v4 is that correct?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants