Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

envoy pods permanently unready #3192

Closed
derekperkins opened this issue Dec 9, 2020 · 49 comments
Closed

envoy pods permanently unready #3192

derekperkins opened this issue Dec 9, 2020 · 49 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor.

Comments

@derekperkins
Copy link

What steps did you take and what happened:

Individual pods in the envoy daemonset start failing readiness checks for no apparent reason and never recover. I'm running a pretty vanilla contour 1.10 setup from the example folder. The only extra flag I added was --root-namespaces=projectcontour. Deleting the pod seems to resolve the problem. I'm running envoy as a daemonset across 36 nodes, and it seems like it takes about 48 hours until another pod becomes unresponsive.

Events:
  Type     Reason     Age                    From     Message
  ----     ------     ----                   ----     -------
  Warning  Unhealthy  21s (x31456 over 35h)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503

What did you expect to happen:

I know that there isn't a livenessProbe configured by default, but I would expect that somewhere before 31k failed readinessProbes that the pod would restart or otherwise heal itself.

Anything else you would like to add:

Here are the only warning logs I see from envoy, but there are no errors and this happens occasionally without halting execution.

(actual requests that are routed)
...
[2020-12-09 16:28:08.530][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamEndpoints gRPC config stream closed: 13, 
[2020-12-09 16:28:09.670][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 13, 
[2020-12-09 16:28:09.670][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamListeners gRPC config stream closed: 13, 
[2020-12-09 16:28:09.670][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 13, 
[2020-12-09 16:28:09.670][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamRoutes gRPC config stream closed: 13, 
[2020-12-09 16:28:09.670][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamRoutes gRPC config stream closed: 13, 
[2020-12-09 16:28:21.112][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamRoutes gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: local reset
[2020-12-09 16:28:21.697][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamListeners gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: local reset
[2020-12-09 16:28:21.697][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamSecrets gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: local reset
[2020-12-09 16:28:21.697][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamClusters gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: local reset
[2020-12-09 16:28:21.697][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamRoutes gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: local reset
[2020-12-09 16:28:21.697][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamEndpoints gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: local reset
[2020-12-09 16:28:24.582][1][info][upstream] [source/common/upstream/cds_api_impl.cc:64] cds: add 1 cluster(s), remove 2 cluster(s)
...
(actual requests that are routed)

Environment:

  • Contour version: 1.10.0 (using docker.io/projectcontour/contour:v1.10.0, not docker.io/projectcontour/contour:main`
  • Kubernetes version: (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T18:49:28Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.14-gke.400", GitCommit:"b00b302d1576bfe28ae50661585cc516fda2227e", GitTreeState:"clean", BuildDate:"2020-11-19T09:22:49Z", GoVersion:"go1.13.15b4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes installer & version: GKE
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): cos-containerd
$ kubectl -n projectcontour describe po envoy-zmhqt
Name:         envoy-zmhqt
Namespace:    projectcontour
Priority:     0
Node:         gke-my-node/10.10.10.10
Start Time:   Mon, 07 Dec 2020 15:14:54 -0700
Labels:       app=envoy
              controller-revision-hash=66ddb6b8dc
              pod-template-generation=1
Annotations:  prometheus.io/path: /stats/prometheus
              prometheus.io/port: 8002
              prometheus.io/scrape: true
Status:       Running
IP:           10.48.73.200
IPs:
  IP:           10.48.73.200
Controlled By:  DaemonSet/envoy
Init Containers:
  envoy-initconfig:
    Container ID:  containerd://a3fd5a562750d47589737a004d588b2e52fca91fb2ebb044e897823575b74230
    Image:         docker.io/projectcontour/contour:v1.10.0
    Image ID:      docker.io/projectcontour/contour@sha256:ab519ad7c8e8caeb263469ed2141e9833098ff407a49fb4e5c43732ac7062fdf
    Port:          <none>
    Host Port:     <none>
    Command:
      contour
    Args:
      bootstrap
      /config/envoy.json
      --xds-address=contour
      --xds-port=8001
      --xds-resource-version=v3
      --resources-dir=/config/resources
      --envoy-cafile=/certs/ca.crt
      --envoy-cert-file=/certs/tls.crt
      --envoy-key-file=/certs/tls.key
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 07 Dec 2020 15:14:55 -0700
      Finished:     Mon, 07 Dec 2020 15:14:56 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      CONTOUR_NAMESPACE:  projectcontour (v1:metadata.namespace)
    Mounts:
      /certs from envoycert (ro)
      /config from envoy-config (rw)
Containers:
  shutdown-manager:
    Container ID:  containerd://035aafca8974d810e9e0a515ce2f9c2fbd013503a12d3306549e67f8d60c12ab
    Image:         docker.io/projectcontour/contour:v1.10.0
    Image ID:      docker.io/projectcontour/contour@sha256:ab519ad7c8e8caeb263469ed2141e9833098ff407a49fb4e5c43732ac7062fdf
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/contour
    Args:
      envoy
      shutdown-manager
    State:          Running
      Started:      Wed, 09 Dec 2020 09:35:29 -0700
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 09 Dec 2020 09:33:42 -0700
      Finished:     Wed, 09 Dec 2020 09:35:29 -0700
    Ready:          True
    Restart Count:  7
    Liveness:       http-get http://:8090/healthz delay=3s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:         <none>
  envoy:
    Container ID:  containerd://283e3eb92444a6d382e0a292b4d57d88719c48936781809dd70e2adc7a3fdf84
    Image:         docker.io/envoyproxy/envoy:v1.16.0
    Image ID:      docker.io/envoyproxy/envoy@sha256:9e72bbba48041223ccf79ba81754b1bd84a67c6a1db8a9dbff77ea6fc1cb04ea
    Ports:         80/TCP, 443/TCP
    Host Ports:    80/TCP, 443/TCP
    Command:
      envoy
    Args:
      -c
      /config/envoy.json
      --service-cluster $(CONTOUR_NAMESPACE)
      --service-node $(ENVOY_POD_NAME)
      --log-level info
    State:          Running
      Started:      Mon, 07 Dec 2020 15:14:58 -0700
    Ready:          False
    Restart Count:  0
    Readiness:      http-get http://:8002/ready delay=3s timeout=1s period=4s #success=1 #failure=3
    Environment:
      CONTOUR_NAMESPACE:  projectcontour (v1:metadata.namespace)
      ENVOY_POD_NAME:     envoy-zmhqt (v1:metadata.name)
    Mounts:
      /certs from envoycert (rw)
      /config from envoy-config (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  envoy-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  envoycert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  envoycert
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
@derekperkins derekperkins added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Dec 9, 2020
@stevesloka
Copy link
Member

Are your Contour pod(s) healthy? Are there any logs in there? The readiness probe is looking at a built-in Envoy status endpoint, so it seems like either they've lost connection to Contour somehow or there's some sort of configuration they've encountered that caused them to hit an error condition.

@derekperkins
Copy link
Author

The contour pods are healthy and have been for weeks since I upgraded to 1.10. My logs have this line repeated thousands of times, with no other logs interspersed.

time="2020-12-09T08:22:47Z" level=info msg="handling v3 xDS resource request" connection=29 context=xds node_id=envoy-zmhqt node_version=v1.16.0 resource_names="[workspaces--api/api/rest]" response_nonce=68298 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=68298

The HTTPProxy referenced in that log line is a simple delegated proxy handling /, and it serves content correctly. Here is my proxy/cert configuration.

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: contour.mydomain.com
  namespace: projectcontour
spec:
  virtualhost:
    fqdn: contour.mydomain.com
    tls:
      secretName: contour.mydomain.com
    corsPolicy:
        allowCredentials: true
        allowOrigin:
          - "*" # allows any origin
        allowMethods:
          - GET
          - POST
          - PUT
          - PATCH
          - DELETE
          - OPTIONS
        allowHeaders:
          - authorization
          - accept
          - content-type
          - accept-encoding
        exposeHeaders:
          - Content-Length
          - Content-Range
        maxAge: "60m"
  includes:
  # Includes the /service2 path from service2 in the same namespace
  - name: api
    namespace: workspaces--api
    conditions:
    - prefix: /

---

apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: api
  namespace: workspaces--api
spec:
  routes:
    - services:
        - name: api
          port: 8080

---

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: contour.mydomain.com
  namespace: projectcontour
spec:
  # Secret names are always required.
  secretName: contour.mydomain.com
  duration: 2160h # 90d
  renewBefore: 360h # 15d
  subject:
    organizations:
    - nozzle
  # At least one of a DNS Name, URI, or IP address is required.
  dnsNames:
  - contour.mydomain.com
  # Issuer references are always required.
  issuerRef:
    name: letsencrypt-prod
    # We can reference ClusterIssuers by changing the kind here.
    # The default value is Issuer (i.e. a locally namespaced Issuer)
    kind: ClusterIssuer
    # This is optional since cert-manager will default to this value however
    # if you are using an external issuer, change this to that issuer group.
    group: cert-manager.io

@stevesloka
Copy link
Member

Could you port-forward to one of the unhealthy Envoy's when this happens next on port 9001? Then look at the configuration page. Typically at the bottom of that, there's some information about errors, etc if somehow the config is invalid.

Also curious what the status code of the /ready is from Envoy on port 8002.

@derekperkins
Copy link
Author

Interesting, http://localhost:9001/ready is returning a 503 with body DRAINING, but no other indicators I can see regarding errors. Not sure what would be causing the draining state. There are ~3 dozen other pods on the same node running cleanly and without being newly scheduled, so I don't think there was any sort of cluster / node health event.

I dumped the /server_info and /clusters endpoints into a gist if that is useful at all.
https://gist.github.com/derekperkins/ba673c6b95802b0086242da23dfb3585

I thought I had posted this before, and it was probably inferred, but Contour does mark my HTTPProxy config as valid.

$ kubectl get httpproxy -A -o wide
NAMESPACE         NAME                   FQDN                   TLS SECRET             STATUS   STATUS DESCRIPTION
projectcontour    contour.mydomain.com   contour.mydomain.com   contour.mydomain.com   valid    Valid HTTPProxy
workspaces--api   api                                                                  valid    Valid HTTPProxy

@stevesloka
Copy link
Member

Interesting, draining would tell me that the pod is getting shut down. What is the state of the pod when this happens? 1-2 ready?

Could you share the output of the /config_dump redacting any bits that may be private to your environment. At the bottom of that page there's typically good error information if something isn't right which might help out.

@derekperkins
Copy link
Author

I upgraded to Contour 1.11.0 and Envoy 1.16.2 wondering if that would solve anything, but after a couple days, I'm back to a draining envoy pod. Yes, 1/2 pods are showing ready.

Here's the config_dump data.
https://gist.github.com/derekperkins/33ff2c1bac01879490a081c63feed54a

@youngnick
Copy link
Member

youngnick commented Dec 23, 2020

So, the DRAINING state should only be caused by the pod attempting to shut down - there's a container that runs a Contour shutdown manager for Envoy, and a preStopHook that will tell Envoy to go into draining mode.

A few questions for you:

  • Do your nodes get rotated regularly, or this there some other reason why the kubelet might try to shut down the Envoy pods?
  • Did you start with the 1.10 install, or did you start with an earlier install and upgrade? I ask because we recently made some changes to how the shutdown manager works, and if the Envoy daemonset YAML hasn't been updated, they might not be working properly. That could explain why the preStopHook fired, and put the pod into DRAINING, but I thought there was a timeout where the pod would be forcibly killed if the hook did not complete. @stevesloka, any thoughts on something like this?

(I should note that most of the maintainers are on leave at the moment, so we may be a little slow in getting back to you on this over the holiday season.)

@derekperkins
Copy link
Author

  • We don't rotate all that often. We're running on GKE with auto-repair on, but we manually trigger node upgrades. The current node with the draining envoy pod has been running for 6 weeks. I see a couple restarts on other daemonsets on the node, but nothing that would indicate that anything has happened to the node, and there aren't any events on the node (not sure what the lifetime of those events is)
  • We may have started with 1.9, I'm not 100% sure. I do know that the envoy daemonset has been deployed with the latest 1.11 yaml. All pods show creation dates of 3 days ago when I upgraded.

preStopHooks are the worst to debug since the logs just disappear into nothing.

No worries on the timeline. This isn't an urgent issue, but one that seems worth resolving.

@youngnick
Copy link
Member

Checking back in on this since I'm back from leave - thanks for putting this one into our "Needs Investigation" queue @skriss.

Looks to me like we need to figure out why the Envoy is going into DRAINING. We will probably need to go through the Envoy logs for this one at some point, but we may be able to get something out of the shutdown manager (I can't remember how much logging is available there). That would let us find out whether the shutdown manager is being called to put Envoy into DRAINING or if it's something else.

@derekperkins
Copy link
Author

I don't see any logs coming out of the shutdown manager. I added the envoy logs to my gist, they're pretty small
https://gist.github.com/derekperkins/ba673c6b95802b0086242da23dfb3585#file-envoy-logs

It may or may not be relevant, but we're not running any traffic through contour yet, so the requests that are in the logs are just random internet spam. Maybe the containers aren't responding well to minimal traffic?

@McShauno
Copy link

I am running into this same issue as well. We're running Contour on 40 nodes and just a couple of them will go into this same state with the readiness probes failing with a 503 response.

Let me know if I can help in anyway.

@derekperkins
Copy link
Author

@McShauno Can you post your infrastructure / version details?

@youngnick
Copy link
Member

Thanks for the logs @derekperkins, it does seem really weird that the Envoys are flipping into Draining, and it looks like in the logs you posted, it's only a few minutes after starting that something changes (when the config streams are cut).

I'm wondering if something could be causing the gRPC connection to drop, and that's forcing the drain from the Envoy side?

In any case, I just noticed that the Contour shutdown manager had 7 restarts, and had exited with an error. Given that's what triggers Envoy to flip into draining, it's probably worth investigating more there. If nothing else, we can put debug logging into our next release, so you can try and get some shutdown-manager logs.

It may or may not be relevant, but we're not running any traffic through contour yet, so the requests that are in the logs are just random internet spam. Maybe the containers aren't responding well to minimal traffic?

It's very possible, yes. I've seen things before where no traffic and no changes meant no xDS traffic, so connections would drop. Some time ago, we added gRPC keepalives for the xDS connection, so it's probably not that, but it may be related to some

@derekperkins
Copy link
Author

We rolled contour into production yesterday after some final debugging and re-rolling envoy a few times, so all pods are currently healthy. We'll see if having some traffic impacts pods getting into this draining state. We also implemented external auth, so that may cause different behavior as well.

Happy to roll out shutdown-manager debug or whatever else makes sense to help find the root cause.

@youngnick
Copy link
Member

youngnick commented Jan 29, 2021

Thanks @derekperkins, keep us posted, and I'll see what we can do about debug info for the shutdown-manager.

@stevesloka
Copy link
Member

stevesloka commented Jan 29, 2021

There are a bunch of logs in shutdown-manager to log what's going on. If those logs are empty, but there could be some reasons as to why they are empty.

Could you look at the logs of the last restarted shutdown-manager container? If there are logs there it would tell us that this is the case.

Being there are restart events on the shutdown manager container, it could happen that the shutdown-manager failed it's liveliness probe and was restarted by the kubelet. If this is the case and the kubelet restarted that container, I think the preStop hook would still fire and the sequence would start, but since a new container is spun up, then there aren't any logs in the current running version.

I'd be curious about the logs of the restarted container as well as maybe try taking out the liveliness probe on the shutdown-manager? I'm curious as to why it might be failing, but would also help with understanding if shutdown-manager is actually the root cause.

@stevesloka
Copy link
Member

@derekperkins also do you have a single log line that the shutdown-manager was started? It logs when it's first started like this:

$ kubectl logs -f -n projectcontour envoy-r8zbc -c shutdown-manager                                                           
time="2021-01-28T21:34:05Z" level=info msg="started envoy shutdown manager" context=shutdown-manager

@derekperkins
Copy link
Author

Yes, I have that log for all the shutdown-manager containers I've checked. Looking at past logs, the only thing I'm seeing besides that in shutdown-manager is
level=info msg="file /ok does not exist; checking again in 1s" context=shutdownReadyHandler"

@stevesloka
Copy link
Member

Ahh ok so that confirms that the shutdown sequence did happen. I suspect it's because of the kubelet killing the pod because of the liveliness probe. It somehow failed that and the container got the preStop hook but for the wrong reason.

I'd dump that probe and see if the problem comes back.

@stevesloka
Copy link
Member

I posted in #3286 (comment) about some of the logging issues, but something to keep in mind is that the logs from the preStop hook container exec won't be visible in the kubectl logs command of the shutdown manager.

We'll need to see if there's a better way to get at them (other than events), but for now I'd be interested if you took out the liveness probe if the issues went away. Or if you find the issue happens again, if you could pull the events from your cluster and see if it tells anything interesting.

@McShauno
Copy link

@McShauno Can you post your infrastructure / version details?

Apologies for the delayed response.

We are running PKS/TKGI 1.9.1 on 50 nodes with version with Contour v1.11.0. Kubernetes version v1.18.8+vmware.1.

@fjudith
Copy link

fjudith commented Feb 27, 2021

same issue on kubeadm 1.20.4 vanilla

level=info msg="file /ok does not exist; checking again in 1s" context=shutdownReadyHandler

@youngnick
Copy link
Member

@McShauno, @fjudith, @derekperkins, there have been a few Contour versions since we last heard anything on this issue, have you seen these problems recently?

@youngnick youngnick self-assigned this May 19, 2021
@McShauno
Copy link

@McShauno, @fjudith, @derekperkins, there have been a few Contour versions since we last heard anything on this issue, have you seen these problems recently?

@youngnick I will upgrade and let you know within the next couple of days.

@derekperkins
Copy link
Author

I can't remember exactly when it disappeared, but this has been solved for a while for us.

@youngnick
Copy link
Member

Thanks for the updates @derekperkins and @McShauno. I'll wait for a few days for a check-in from @McShauno before closing this one, but looks like it's fixed somehow.

@McShauno
Copy link

@youngnick

Actually, it seems we might still be having the same problem. I upgraded to the latest version about 48h ago and here is the current pod list:

image

You'll notice there is one pod towards the bottom with 1/2. Here are the details on that pod:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    k8s.v1.cni.cncf.io/network-status: |-
      [
        {
          "is_passthrough_net": false,
          "interface": "eth0",
          "vlan_id": XXX,
          "default": true,
          "attachment_id": "e83d1b62-d89a-4f1b-81e4-72aa1af89f36",
          "wait_for_sync": false,
          "ip": "XXX.XX.XX.XX/24",
          "name": "cluster-wide-default",
          "gateway_ip": "X.X.X.X",
          "mac": "XXX:XX:XX:00:XX:XX"
        }
      ]
    prometheus.io/path: /stats/prometheus
    prometheus.io/port: "8002"
    prometheus.io/scrape: "true"
  creationTimestamp: "2021-05-25T13:53:07Z"
  generateName: envoy-
  labels:
    app: envoy
    controller-revision-hash: 775bbbdddf
    pod-template-generation: "1"
  name: envoy-pskgj
  namespace: projectcontour
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: envoy
    uid: 12e4a755-0d85-46e5-a89f-288c257a72a1
  resourceVersion: "747713495"
  selfLink: /api/v1/namespaces/projectcontour/pods/envoy-pskgj
  uid: 52914e35-8284-430f-91db-5c533f487513
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - e3ffe8d5-bb95-40ec-8fa7-b4182ec8b497
  automountServiceAccountToken: false
  containers:
  - args:
    - envoy
    - shutdown-manager
    command:
    - /bin/contour
    image: docker.io/projectcontour/contour:v1.15.0
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/contour
          - envoy
          - shutdown
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8090
        scheme: HTTP
      initialDelaySeconds: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: shutdown-manager
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  - args:
    - -c
    - /config/envoy.json
    - --service-cluster $(CONTOUR_NAMESPACE)
    - --service-node $(ENVOY_POD_NAME)
    - --log-level info
    command:
    - envoy
    env:
    - name: CONTOUR_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    - name: ENVOY_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: docker.io/envoyproxy/envoy:v1.18.2
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        httpGet:
          path: /shutdown
          port: 8090
          scheme: HTTP
    name: envoy
    ports:
    - containerPort: 8080
      hostPort: 80
      name: http
      protocol: TCP
    - containerPort: 8443
      hostPort: 443
      name: https
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /ready
        port: 8002
        scheme: HTTP
      initialDelaySeconds: 3
      periodSeconds: 4
      successThreshold: 1
      timeoutSeconds: 1
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /config
      name: envoy-config
      readOnly: true
    - mountPath: /certs
      name: envoycert
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - args:
    - bootstrap
    - /config/envoy.json
    - --xds-address=contour
    - --xds-port=8001
    - --xds-resource-version=v3
    - --resources-dir=/config/resources
    - --envoy-cafile=/certs/ca.crt
    - --envoy-cert-file=/certs/tls.crt
    - --envoy-key-file=/certs/tls.key
    command:
    - contour
    env:
    - name: CONTOUR_NAMESPACE
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.namespace
    image: docker.io/projectcontour/contour:v1.15.0
    imagePullPolicy: IfNotPresent
    name: envoy-initconfig
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /config
      name: envoy-config
    - mountPath: /certs
      name: envoycert
      readOnly: true
  nodeName: e3ffe8d5-bb95-40ec-8fa7-b4182ec8b497
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: envoy
  serviceAccountName: envoy
  terminationGracePeriodSeconds: 300
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  volumes:
  - emptyDir: {}
    name: envoy-config
  - name: envoycert
    secret:
      defaultMode: 420
      secretName: envoycert
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-05-25T13:53:20Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-05-25T18:35:18Z"
    message: 'containers with unready status: [envoy]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-05-25T18:35:18Z"
    message: 'containers with unready status: [envoy]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-05-25T13:53:07Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://d9527b546cad0b1b910c43b8b6ae6dcd40a7c87f15c2a2406955ecacf77f2cd9
    image: envoyproxy/envoy:v1.18.2
    imageID: docker-pullable://envoyproxy/envoy@sha256:e8b37c1d75787dd1e712ff389b0d37337dc8a174a63bed9c34ba73359dc67da7
    lastState: {}
    name: envoy
    ready: false
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-05-25T13:53:27Z"
  - containerID: docker://f72285b7ecf6365c45ac10e1440438f308d5fe77ea4392459dbd4716d1d80dd0
    image: projectcontour/contour:v1.15.0
    imageID: docker-pullable://projectcontour/contour@sha256:6ce4fa337e4fa5d8588e2b7e78082c75815c48e4bd0db5c3f5f4615ddaeaf3fc
    lastState:
      terminated:
        containerID: docker://e504fcaac0ce59a0b2a4c02a7d489d641c72d4debf0a694804c05a8607cce57d
        exitCode: 2
        finishedAt: "2021-05-25T18:37:57Z"
        reason: Error
        startedAt: "2021-05-25T13:53:21Z"
    name: shutdown-manager
    ready: true
    restartCount: 1
    started: true
    state:
      running:
        startedAt: "2021-05-25T18:37:59Z"
  hostIP: XXX.XX.XX.X
  initContainerStatuses:
  - containerID: docker://ae19705a968d0e0a432f570b3e5d16c6b36b82fd131de90c384c31423230e5c8
    image: projectcontour/contour:v1.15.0
    imageID: docker-pullable://projectcontour/contour@sha256:6ce4fa337e4fa5d8588e2b7e78082c75815c48e4bd0db5c3f5f4615ddaeaf3fc
    lastState: {}
    name: envoy-initconfig
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: docker://ae19705a968d0e0a432f570b3e5d16c6b36b82fd131de90c384c31423230e5c8
        exitCode: 0
        finishedAt: "2021-05-25T13:53:20Z"
        reason: Completed
        startedAt: "2021-05-25T13:53:20Z"
  phase: Running
  podIP: XXX.XX.XX.X
  podIPs:
  - ip: XXX.XX.XX.X
  qosClass: BestEffort
  startTime: "2021-05-25T13:53:07Z"

This is the same issue we have been seeing in the past and also one of the reasons we haven't starting using Contour for all apps that need ingress.

@youngnick
Copy link
Member

Just to confirm @McShauno, that one Envoy pod is responding to the /ready endpoint with a 503 and DRAINING still? If so, then there's still something we need to find about what's triggering the shutdown manager to tell Envoy to start draining connections.

Just as a reminder, the normal flow is:

  • Envoy pod starts, is unhealthy
  • Envoy pod connects to Contour for configuration, will open ports once valid config is received from Contour. Once that happens, it will flip to healthy.
  • Envoy pod will run as normal
  • When the kubelet asks the pod to shut down, the prestop hooks will fire, and the shutdown manager container will POST to /healthcheck/fail on the Envoy container to flip Envoy into draining mode.
  • The shutdown manager will monitor the open connections using Prometheus metrics until either all the connections have drained or an interval has passed, at which time it will write a file /ok in the shared volume. This file's presence will then tell the shutdown manager to finish the preStop hook.

So there are a few places that this could go wrong. In order for us to find the place, we're going to need to go pretty deep on the pod that has the problem.

Could you forward the admin interface of the pod, and check the configuration dump? (Some help is at https://projectcontour.io/docs/v1.16.0/troubleshooting/envoy-admin-interface/ if you need it). Also, if you could check the logs for the containers, including Envoy and the shutdown manager, that would be helpful.

@Legion2
Copy link

Legion2 commented May 30, 2021

I'm currently investigating envoy pod container restart alerts from our cluster and discovered this issue. I also have many time="2021-05-30T15:05:44Z" level=info msg="file /ok does not exist; checking again in 1s" context=shutdownReadyHandler in the logs. I think this is because the enovy container was restarted but the shutdown-manger didn't recognize it, so shutdown-manger still waits for the file /ok but the envoy container was already restarted and is running "normally" (readiness probe fails sometimes).

@youngnick
Copy link
Member

hmm, that is a good point, and it would make sense that the shutdown-manager is not resilient against something like that. Do you have any idea how the envoy container was restarted without the whole pod being restarted?

@calshankar
Copy link

calshankar commented May 31, 2021

@youngnick i can confirm the same behavior. I'm able to reproduce the issue in multi-node kind cluster.. To me it looks like sync issue whenever the pod restarts either due to docker restart or machine reboot. Let me know if you need any further details..

I'm currently investigating envoy pod container restart alerts from our cluster and discovered this issue. I also have many time="2021-05-30T15:05:44Z" level=info msg="file /ok does not exist; checking again in 1s" context=shutdownReadyHandler in the logs. I think this is because the enovy container was restarted but the shutdown-manger didn't recognize it, so shutdown-manger still waits for the file /ok but the envoy container was already restarted and is running "normally" (readiness probe fails sometimes).

Environment:

kind version: (use kind version): v0.11.0
Kubernetes version: (use kubectl version): 1.21
Docker version: (use docker info): attached
OS (e.g. from /etc/os-release): MANJARO Kernel -> 5.11.18-1

Docker info...

Server Version: 20.10.6
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2`

@youngnick
Copy link
Member

Thanks for that @calshankar, I will try and repro myself as well.

@Legion2
Copy link

Legion2 commented May 31, 2021

@youngnick when the readiness probe of a container fails only the container is restarted, the pod is not restarted for such an event.

@McShauno
Copy link

McShauno commented Jun 1, 2021

Thanks @youngnick. I will get this information together this week.

Just to confirm @McShauno, that one Envoy pod is responding to the /ready endpoint with a 503 and DRAINING still? If so, then there's still something we need to find about what's triggering the shutdown manager to tell Envoy to start draining connections.

Just as a reminder, the normal flow is:

  • Envoy pod starts, is unhealthy
  • Envoy pod connects to Contour for configuration, will open ports once valid config is received from Contour. Once that happens, it will flip to healthy.
  • Envoy pod will run as normal
  • When the kubelet asks the pod to shut down, the prestop hooks will fire, and the shutdown manager container will POST to /healthcheck/fail on the Envoy container to flip Envoy into draining mode.
  • The shutdown manager will monitor the open connections using Prometheus metrics until either all the connections have drained or an interval has passed, at which time it will write a file /ok in the shared volume. This file's presence will then tell the shutdown manager to finish the preStop hook.

So there are a few places that this could go wrong. In order for us to find the place, we're going to need to go pretty deep on the pod that has the problem.

Could you forward the admin interface of the pod, and check the configuration dump? (Some help is at https://projectcontour.io/docs/v1.16.0/troubleshooting/envoy-admin-interface/ if you need it). Also, if you could check the logs for the containers, including Envoy and the shutdown manager, that would be helpful.

@stevesloka
Copy link
Member

i can confirm the same behavior. I'm able to reproduce the issue in multi-node kind cluster.

@calshankar could you share your notes on how to reproduce in kind?

@calshankar
Copy link

calshankar commented Jun 1, 2021

@stevesloka @youngnick this is not an issue with contour as such.. kubernetes-sigs/kind#2045 .. not sure whether this fixed or documented on the landing page for multi node cluster

unfortunately kind does not provide/ensure stable IP for control plane because it is not behind dns.. Ideally it should use dns atleast for control plane components.. This issue can prop up whenever docker/machine reboots.. 🙏 & don't restart/reboot or create kind cluster again 😄 .. Not an issue from my side..

@sharpjs
Copy link

sharpjs commented Jul 28, 2021

I think I might have this behaviour in an Azure Kubernetes Service cluster using k8s 1.19.9. I'll try upgrading k8s to see if that changes anything. Diags follow, a few IPs/names redacted.

EDIT: 15 hours after upgrading this AKS cluster to k8s 1.20.7, all of the Envoy pods are healthy with 0 restarts. I'll update again if that changes.

Things that might be useful:

  • The problem has always been with the first node in the pool, ...-vmss000000.
  • Envoy spits out a bunch of Lua-related error messages.
  • I have enableExternalNameService: true because my application needs ExternalName services, and the CVE does not apply to my situation.
kubectl get all
NAME                                READY   STATUS      RESTARTS   AGE
pod/contour-certgen-v1.18.0-gxss8   0/1     Completed   0          67m
pod/contour-f6b8fffd7-25m57         1/1     Running     0          66m
pod/contour-f6b8fffd7-zvfjn         1/1     Running     0          66m
pod/envoy-fpgh9                     2/2     Running     0          47m
pod/envoy-h6p6h                     2/2     Running     0          54m
pod/envoy-hthht                     2/2     Running     0          58m
pod/envoy-lswxr                     2/2     Running     0          57m
pod/envoy-qzqg2                     2/2     Running     0          55m
pod/envoy-v5d7n                     2/2     Running     0          48m
pod/envoy-vts86                     2/2     Running     0          51m
pod/envoy-w79j5                     2/2     Running     0          50m
pod/envoy-wh4p8                     2/2     Running     0          53m
pod/envoy-wqjk4                     1/2     Running     2          60m

NAME              TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                      AGE
service/contour   ClusterIP      10.0.233.185   <none>          8001/TCP                     64d
service/envoy     LoadBalancer   10.0.251.42    xx.xxx.xx.xxx   80:32726/TCP,443:30794/TCP   64d

NAME                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/envoy   10        10        9       10           9           <none>          64d

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/contour   2/2     2            2           64d

NAME                                DESIRED   CURRENT   READY   AGE
replicaset.apps/contour-f6b8fffd7   2         2         2       66m

NAME                                COMPLETIONS   DURATION   AGE
job.batch/contour-certgen-v1.18.0   1/1           6s         67m
kubectl describe pod/envoy-wqjk4
Name:         envoy-wqjk4
Namespace:    projectcontour
Priority:     0
Node:         aks-lnodes0-19708131-vmss000000/10.240.0.4
Start Time:   Wed, 28 Jul 2021 14:05:00 -0700
Labels:       app=envoy
              controller-revision-hash=76f9cc5cbf
              pod-template-generation=5
Annotations:  prometheus.io/path: /stats/prometheus
              prometheus.io/port: 8002
              prometheus.io/scrape: true
Status:       Running
IP:           10.240.0.21
IPs:
  IP:           10.240.0.21
Controlled By:  DaemonSet/envoy
Init Containers:
  envoy-initconfig:
    Container ID:  containerd://14a5b639395a350dd897c8ed9d9950033852e9fdc924a680abfa0b5e5eb7da90
    Image:         docker.io/projectcontour/contour:v1.18.0
    Image ID:      docker.io/projectcontour/contour@sha256:8f3eea380422283e29a788d71712cb208bb07b57668029b6edda71fa7e050cb0
    Port:          <none>
    Host Port:     <none>
    Command:
      contour
    Args:
      bootstrap
      /config/envoy.json
      --xds-address=contour
      --xds-port=8001
      --xds-resource-version=v3
      --resources-dir=/config/resources
      --envoy-cafile=/certs/ca.crt
      --envoy-cert-file=/certs/tls.crt
      --envoy-key-file=/certs/tls.key
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 28 Jul 2021 14:05:05 -0700
      Finished:     Wed, 28 Jul 2021 14:05:05 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      CONTOUR_NAMESPACE:  projectcontour (v1:metadata.namespace)
    Mounts:
      /certs from envoycert (ro)
      /config from envoy-config (rw)
Containers:
  shutdown-manager:
    Container ID:  containerd://48d20bb1ce17c802870b9be4eae1e5bc6d35c60e77c25dd7c1b2c50b29e8574b
    Image:         docker.io/projectcontour/contour:v1.18.0
    Image ID:      docker.io/projectcontour/contour@sha256:8f3eea380422283e29a788d71712cb208bb07b57668029b6edda71fa7e050cb0
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/contour
    Args:
      envoy
      shutdown-manager
    State:          Running
      Started:      Wed, 28 Jul 2021 14:55:47 -0700
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 28 Jul 2021 14:45:42 -0700
      Finished:     Wed, 28 Jul 2021 14:55:47 -0700
    Ready:          True
    Restart Count:  2
    Liveness:       http-get http://:8090/healthz delay=3s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:         <none>
  envoy:
    Container ID:  containerd://9123103eca4765dcdef558606b61b5571195868cc0bb0292df63f4b4e8cb09f4
    Image:         docker.io/envoyproxy/envoy:v1.19.0
    Image ID:      docker.io/envoyproxy/envoy@sha256:ec7228053c7e99bf481901960b9074528be407ede2363b6152fb93a1eee872cf
    Ports:         8080/TCP, 8443/TCP
    Host Ports:    80/TCP, 443/TCP
    Command:
      envoy
    Args:
      -c
      /config/envoy.json
      --service-cluster $(CONTOUR_NAMESPACE)
      --service-node $(ENVOY_POD_NAME)
      --log-level info
    State:          Running
      Started:      Wed, 28 Jul 2021 14:05:12 -0700
    Ready:          False
    Restart Count:  0
    Readiness:      http-get http://:8002/ready delay=3s timeout=1s period=4s #success=1 #failure=3
    Environment:
      CONTOUR_NAMESPACE:  projectcontour (v1:metadata.namespace)
      ENVOY_POD_NAME:     envoy-wqjk4 (v1:metadata.name)
    Mounts:
      /certs from envoycert (ro)
      /config from envoy-config (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  envoy-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  envoycert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  envoycert
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  57m                    default-scheduler  Successfully assigned projectcontour/envoy-wqjk4 to aks-lnodes0-19708131-vmss000000
  Normal   Pulling    57m                    kubelet            Pulling image "docker.io/projectcontour/contour:v1.18.0"
  Normal   Created    57m                    kubelet            Created container envoy-initconfig
  Normal   Pulled     57m                    kubelet            Successfully pulled image "docker.io/projectcontour/contour:v1.18.0" in 2.422063916s
  Normal   Pulled     57m                    kubelet            Successfully pulled image "docker.io/projectcontour/contour:v1.18.0" in 541.468544ms
  Normal   Started    57m                    kubelet            Started container envoy-initconfig
  Normal   Pulling    57m                    kubelet            Pulling image "docker.io/projectcontour/contour:v1.18.0"
  Normal   Pulling    57m                    kubelet            Pulling image "docker.io/envoyproxy/envoy:v1.19.0"
  Normal   Created    57m                    kubelet            Created container shutdown-manager
  Normal   Started    57m                    kubelet            Started container shutdown-manager
  Normal   Pulled     57m                    kubelet            Successfully pulled image "docker.io/envoyproxy/envoy:v1.19.0" in 6.245411577s
  Normal   Created    57m                    kubelet            Created container envoy
  Normal   Started    57m                    kubelet            Started container envoy
  Warning  Unhealthy  18m (x6 over 48m)      kubelet            Liveness probe failed: Get "http://10.240.0.21:8090/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  18m (x25 over 48m)     kubelet            Readiness probe failed: Get "http://10.240.0.21:8002/ready": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  2m36s (x219 over 17m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503
kubectl logs pod/envoy-wqjk4 -c shutdown-manager
time="2021-07-28T21:55:47Z" level=info msg="started envoy shutdown manager" context=shutdown-manager
kubectl logs pod/envoy-wqjk4 -c envoy
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:338] initializing epoch 0 (base id=0, hot restart version=11.104)
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:340] statically linked extensions:
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:342]   envoy.compression.compressor: envoy.compression.brotli.compressor, envoy.compression.gzip.compressor
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:342]   envoy.internal_redirect_predicates: envoy.internal_redirect_predicates.allow_listed_routes, envoy.internal_redirect_predicates.previous_routes, envoy.internal_redirect_predicates.safe_cross_scheme
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:342]   envoy.http.stateful_header_formatters: preserve_case
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:342]   envoy.dubbo_proxy.filters: envoy.filters.dubbo.router
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:342]   envoy.resolvers: envoy.ip
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:342]   envoy.http.original_ip_detection: envoy.http.original_ip_detection.custom_header, envoy.http.original_ip_detection.xff
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:342]   envoy.upstream_options: envoy.extensions.upstreams.http.v3.HttpProtocolOptions, envoy.upstreams.http.http_protocol_options
[2021-07-28 21:05:12.680][1][info][main] [source/server/server.cc:342]   envoy.quic.proof_source: envoy.quic.proof_source.filter_chain
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.matching.action: composite-action, skip
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.clusters: envoy.cluster.eds, envoy.cluster.logical_dns, envoy.cluster.original_dst, envoy.cluster.static, envoy.cluster.strict_dns, envoy.clusters.aggregate, envoy.clusters.dynamic_forward_proxy, envoy.clusters.redis
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.matching.input_matchers: envoy.matching.matchers.consistent_hashing, envoy.matching.matchers.ip
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.transport_sockets.upstream: envoy.transport_sockets.alts, envoy.transport_sockets.quic, envoy.transport_sockets.raw_buffer, envoy.transport_sockets.starttls, envoy.transport_sockets.tap, envoy.transport_sockets.tls, envoy.transport_sockets.upstream_proxy_protocol, raw_buffer, starttls, tls
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.filters.network: envoy.client_ssl_auth, envoy.echo, envoy.ext_authz, envoy.filters.network.client_ssl_auth, envoy.filters.network.connection_limit, envoy.filters.network.direct_response, envoy.filters.network.dubbo_proxy, envoy.filters.network.echo, envoy.filters.network.ext_authz, envoy.filters.network.http_connection_manager, envoy.filters.network.kafka_broker, envoy.filters.network.local_ratelimit, envoy.filters.network.mongo_proxy, envoy.filters.network.mysql_proxy, envoy.filters.network.postgres_proxy, envoy.filters.network.ratelimit, envoy.filters.network.rbac, envoy.filters.network.redis_proxy, envoy.filters.network.rocketmq_proxy, envoy.filters.network.sni_cluster, envoy.filters.network.sni_dynamic_forward_proxy, envoy.filters.network.tcp_proxy, envoy.filters.network.thrift_proxy, envoy.filters.network.wasm, envoy.filters.network.zookeeper_proxy, envoy.http_connection_manager, envoy.mongo_proxy, envoy.ratelimit, envoy.redis_proxy, envoy.tcp_proxy
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.formatter: envoy.formatter.req_without_query
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.grpc_credentials: envoy.grpc_credentials.aws_iam, envoy.grpc_credentials.default, envoy.grpc_credentials.file_based_metadata
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.matching.common_inputs: envoy.matching.common_inputs.environment_variable
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.dubbo_proxy.protocols: dubbo
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.retry_priorities: envoy.retry_priorities.previous_priorities
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.bootstrap: envoy.bootstrap.wasm, envoy.extensions.network.socket_interface.default_socket_interface
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.transport_sockets.downstream: envoy.transport_sockets.alts, envoy.transport_sockets.quic, envoy.transport_sockets.raw_buffer, envoy.transport_sockets.starttls, envoy.transport_sockets.tap, envoy.transport_sockets.tls, raw_buffer, starttls, tls
[2021-07-28 21:05:12.681][1][info][main] [source/server/server.cc:342]   envoy.health_checkers: envoy.health_checkers.redis
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.access_loggers: envoy.access_loggers.file, envoy.access_loggers.http_grpc, envoy.access_loggers.open_telemetry, envoy.access_loggers.stderr, envoy.access_loggers.stdout, envoy.access_loggers.tcp_grpc, envoy.access_loggers.wasm, envoy.file_access_log, envoy.http_grpc_access_log, envoy.open_telemetry_access_log, envoy.stderr_access_log, envoy.stdout_access_log, envoy.tcp_grpc_access_log, envoy.wasm_access_log
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.upstreams: envoy.filters.connection_pools.tcp.generic
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.guarddog_actions: envoy.watchdog.abort_action, envoy.watchdog.profile_action
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.filters.udp_listener: envoy.filters.udp.dns_filter, envoy.filters.udp_listener.udp_proxy
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.http.cache: envoy.extensions.http.cache.simple
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.matching.http.input: request-headers, request-trailers, response-headers, response-trailers
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.retry_host_predicates: envoy.retry_host_predicates.omit_canary_hosts, envoy.retry_host_predicates.omit_host_metadata, envoy.retry_host_predicates.previous_hosts
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.wasm.runtime: envoy.wasm.runtime.null, envoy.wasm.runtime.v8
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.tls.cert_validator: envoy.tls.cert_validator.default, envoy.tls.cert_validator.spiffe
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.compression.decompressor: envoy.compression.brotli.decompressor, envoy.compression.gzip.decompressor
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.thrift_proxy.transports: auto, framed, header, unframed
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.quic.server.crypto_stream: envoy.quic.crypto_stream.server.quiche
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.dubbo_proxy.serializers: dubbo.hessian2
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.rate_limit_descriptors: envoy.rate_limit_descriptors.expr
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.tracers: envoy.dynamic.ot, envoy.lightstep, envoy.tracers.datadog, envoy.tracers.dynamic_ot, envoy.tracers.lightstep, envoy.tracers.opencensus, envoy.tracers.skywalking, envoy.tracers.xray, envoy.tracers.zipkin, envoy.zipkin
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.dubbo_proxy.route_matchers: default
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.resource_monitors: envoy.resource_monitors.fixed_heap, envoy.resource_monitors.injected_resource
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.stats_sinks: envoy.dog_statsd, envoy.graphite_statsd, envoy.metrics_service, envoy.stat_sinks.dog_statsd, envoy.stat_sinks.graphite_statsd, envoy.stat_sinks.hystrix, envoy.stat_sinks.metrics_service, envoy.stat_sinks.statsd, envoy.stat_sinks.wasm, envoy.statsd
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.filters.listener: envoy.filters.listener.http_inspector, envoy.filters.listener.original_dst, envoy.filters.listener.original_src, envoy.filters.listener.proxy_protocol, envoy.filters.listener.tls_inspector, envoy.listener.http_inspector, envoy.listener.original_dst, envoy.listener.original_src, envoy.listener.proxy_protocol, envoy.listener.tls_inspector
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.thrift_proxy.protocols: auto, binary, binary/non-strict, compact, twitter
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.filters.http: envoy.bandwidth_limit, envoy.buffer, envoy.cors, envoy.csrf, envoy.ext_authz, envoy.ext_proc, envoy.fault, envoy.filters.http.adaptive_concurrency, envoy.filters.http.admission_control, envoy.filters.http.alternate_protocols_cache, envoy.filters.http.aws_lambda, envoy.filters.http.aws_request_signing, envoy.filters.http.bandwidth_limit, envoy.filters.http.buffer, envoy.filters.http.cache, envoy.filters.http.cdn_loop, envoy.filters.http.composite, envoy.filters.http.compressor, envoy.filters.http.cors, envoy.filters.http.csrf, envoy.filters.http.decompressor, envoy.filters.http.dynamic_forward_proxy, envoy.filters.http.dynamo, envoy.filters.http.ext_authz, envoy.filters.http.ext_proc, envoy.filters.http.fault, envoy.filters.http.grpc_http1_bridge, envoy.filters.http.grpc_http1_reverse_bridge, envoy.filters.http.grpc_json_transcoder, envoy.filters.http.grpc_stats, envoy.filters.http.grpc_web, envoy.filters.http.header_to_metadata, envoy.filters.http.health_check, envoy.filters.http.ip_tagging, envoy.filters.http.jwt_authn, envoy.filters.http.local_ratelimit, envoy.filters.http.lua, envoy.filters.http.oauth2, envoy.filters.http.on_demand, envoy.filters.http.original_src, envoy.filters.http.ratelimit, envoy.filters.http.rbac, envoy.filters.http.router, envoy.filters.http.set_metadata, envoy.filters.http.squash, envoy.filters.http.tap, envoy.filters.http.wasm, envoy.grpc_http1_bridge, envoy.grpc_json_transcoder, envoy.grpc_web, envoy.health_check, envoy.http_dynamo_filter, envoy.ip_tagging, envoy.local_rate_limit, envoy.lua, envoy.rate_limit, envoy.router, envoy.squash, match-wrapper
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.thrift_proxy.filters: envoy.filters.thrift.rate_limit, envoy.filters.thrift.router
[2021-07-28 21:05:12.682][1][info][main] [source/server/server.cc:342]   envoy.request_id: envoy.request_id.uuid
[2021-07-28 21:05:12.688][1][info][main] [source/server/server.cc:358] HTTP header map info:
[2021-07-28 21:05:12.689][1][info][main] [source/server/server.cc:361]   request header map: 632 bytes: :authority,:method,:path,:protocol,:scheme,accept,accept-encoding,access-control-request-method,authentication,authorization,cache-control,cdn-loop,connection,content-encoding,content-length,content-type,expect,grpc-accept-encoding,grpc-timeout,if-match,if-modified-since,if-none-match,if-range,if-unmodified-since,keep-alive,origin,pragma,proxy-connection,referer,te,transfer-encoding,upgrade,user-agent,via,x-client-trace-id,x-envoy-attempt-count,x-envoy-decorator-operation,x-envoy-downstream-service-cluster,x-envoy-downstream-service-node,x-envoy-expected-rq-timeout-ms,x-envoy-external-address,x-envoy-force-trace,x-envoy-hedge-on-per-try-timeout,x-envoy-internal,x-envoy-ip-tags,x-envoy-max-retries,x-envoy-original-path,x-envoy-original-url,x-envoy-retriable-header-names,x-envoy-retriable-status-codes,x-envoy-retry-grpc-on,x-envoy-retry-on,x-envoy-upstream-alt-stat-name,x-envoy-upstream-rq-per-try-timeout-ms,x-envoy-upstream-rq-timeout-alt-response,x-envoy-upstream-rq-timeout-ms,x-forwarded-client-cert,x-forwarded-for,x-forwarded-proto,x-ot-span-context,x-request-id
[2021-07-28 21:05:12.690][1][info][main] [source/server/server.cc:361]   request trailer map: 136 bytes:
[2021-07-28 21:05:12.690][1][info][main] [source/server/server.cc:361]   response header map: 432 bytes: :status,access-control-allow-credentials,access-control-allow-headers,access-control-allow-methods,access-control-allow-origin,access-control-expose-headers,access-control-max-age,age,cache-control,connection,content-encoding,content-length,content-type,date,etag,expires,grpc-message,grpc-status,keep-alive,last-modified,location,proxy-connection,server,transfer-encoding,upgrade,vary,via,x-envoy-attempt-count,x-envoy-decorator-operation,x-envoy-degraded,x-envoy-immediate-health-check-fail,x-envoy-ratelimited,x-envoy-upstream-canary,x-envoy-upstream-healthchecked-cluster,x-envoy-upstream-service-time,x-request-id
[2021-07-28 21:05:12.690][1][info][main] [source/server/server.cc:361]   response trailer map: 160 bytes: grpc-message,grpc-status
[2021-07-28 21:05:12.691][1][info][admin] [source/server/admin/admin.cc:132] admin address: 127.0.0.1:9001
[2021-07-28 21:05:12.691][1][info][main] [source/server/server.cc:707] runtime: {}
[2021-07-28 21:05:12.692][1][info][config] [source/server/configuration_impl.cc:127] loading tracing configuration
[2021-07-28 21:05:12.692][1][info][config] [source/server/configuration_impl.cc:87] loading 0 static secret(s)
[2021-07-28 21:05:12.692][1][info][config] [source/server/configuration_impl.cc:93] loading 2 cluster(s)
[2021-07-28 21:05:12.694][1][info][config] [source/server/configuration_impl.cc:97] loading 0 listener(s)
[2021-07-28 21:05:12.694][1][info][config] [source/server/configuration_impl.cc:109] loading stats configuration
[2021-07-28 21:05:12.695][1][info][main] [source/server/server.cc:804] starting main dispatch loop
[2021-07-28 21:05:12.697][1][info][runtime] [source/common/runtime/runtime_impl.cc:449] RTDS has finished initialization
[2021-07-28 21:05:12.697][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:202] cm init: initializing cds
[2021-07-28 21:05:12.698][1][warning][main] [source/server/server.cc:682] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
[2021-07-28 21:05:12.713][1][info][upstream] [source/common/upstream/cds_api_helper.cc:28] cds: add 122 cluster(s), remove 2 cluster(s)
[2021-07-28 21:05:12.788][1][info][upstream] [source/common/upstream/cds_api_helper.cc:65] cds: added/updated 122 cluster(s), skipped 0 unmodified cluster(s)
[2021-07-28 21:05:12.847][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:182] cm init: initializing secondary clusters
[2021-07-28 21:05:12.914][1][info][upstream] [source/common/upstream/cluster_manager_impl.cc:206] cm init: all clusters initialized
[2021-07-28 21:05:12.914][1][info][main] [source/server/server.cc:785] all clusters initialized. initializing init manager
[2021-07-28 21:05:12.935][1][info][upstream] [source/server/lds_api.cc:78] lds: add/update listener 'ingress_http'
[2021-07-28 21:05:12.940][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.942][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.943][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.944][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.947][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.948][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.949][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.950][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.952][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.953][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.955][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.956][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.957][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.958][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.959][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.962][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.963][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.964][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.965][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.967][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.970][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.977][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.978][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.983][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.987][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.988][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.992][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.993][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:12.997][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:13.005][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:13.008][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:13.012][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:13.015][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:13.016][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:13.019][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:13.020][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:13.024][1][info][lua] [source/extensions/filters/http/lua/lua_filter.cc:188] envoy_on_response() function not found. Lua filter will not hook responses.
[2021-07-28 21:05:13.052][1][info][upstream] [source/server/lds_api.cc:78] lds: add/update listener 'ingress_https'
[2021-07-28 21:05:13.057][1][info][upstream] [source/server/lds_api.cc:78] lds: add/update listener 'stats-health'
[2021-07-28 21:05:13.104][1][info][config] [source/server/listener_manager_impl.cc:834] all dependencies initialized. starting workers
[2021-07-28T21:16:31.830Z] "GET /api/health/ HTTP/1.1" 200 - 0 145 9 9 "xx.xxx.xxx.x" "-" "c064c5ee-12c6-4ebc-8f28-bfd82515fb40" "redacted.example.com" "10.240.1.37:80"
[2021-07-28T21:17:03.298Z] "GET /api/health/ HTTP/1.1" 200 - 0 145 5 4 "xx.xxx.xxx.x" "-" "7d99f999-1557-4a0d-984c-3ddd783ada86" "redacted.example.com" "10.240.1.37:80"
[2021-07-28T21:17:35.283Z] "GET /api/health/ HTTP/1.1" 200 - 0 146 6 6 "xx.xxx.xxx.x" "-" "454085e0-9994-49ef-935e-3d2d0bdd0776" "redacted.example.com" "10.240.1.37:80"
[2021-07-28T21:18:06.830Z] "GET /api/health/ HTTP/1.1" 200 - 0 146 6 5 "xx.xxx.xxx.x" "-" "89522b76-fcf6-4451-a96d-a5e7f7fb8bfc" "redacted.example.com" "10.240.1.37:80"
[2021-07-28 21:20:13.085][1][info][main] [source/server/drain_manager_impl.cc:171] shutting down parent after drain
contour-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: contour
  namespace: projectcontour
data:
  contour.yaml: |
    # Example file:
    # https://github.com/projectcontour/contour/blob/v1.18.0/examples/contour/01-contour-config.yaml
    #
    # Documentation:
    # https://projectcontour.io/docs/v1.18.0/configuration/#configuration-file

    # -----------------------------------------------------------------------------
    # Customizations

    # Contour as of v1.17.1 disables ExternalName services by default because
    # they can present a security risk.  The risk does not arise in our case,
    # and in fact we need ExternalName services.  So turn them back on.  For
    # more information, see the Contour advisory notice:
    # https://github.com/projectcontour/contour/security/advisories/GHSA-5ph6-qq5x-7jwc
    enableExternalNameService: true

    # -----------------------------------------------------------------------------
    # Below this line is the default config file content, with comments removed.

    disablePermitInsecure: false
    tls:
      fallback-certificate:
      envoy-client-certificate:
    accesslog-format: envoy

@brunda27anaji
Copy link

I have encountered the subjected issue on our multi-node kops cluster on AWS. It's a 4 nodes cluster. PFA the screenshot where 3 of 4 envoy pods are in 1/2 ready state .

Screenshot 2021-07-29 at 16 29 13

All the envoy container of these pods are failing with same error as reported by @derekperkins for this issue and with event message "Readiness probe failed: HTTP probe failed with statuscode: 503" .
Shutdown-manager container has the shutdown event:
kubectl logs pod/envoy-vgzdw -n projectcontour -c shutdown-manager time="2021-07-09T10:26:10Z" level=info msg="started envoy shutdown manager" context=shutdown-manager

Could actually get the configuration dump from the admin interface - but cannot share it has a lot of information which needs security clearance from my organisation ! .

Not sure if there is fix for this ? Could only think of possible reasons from the discussion above.

@sharpjs
Copy link

sharpjs commented Jul 29, 2021

Update: 15 hours after upgrading my AKS cluster to k8s 1.20.7, all of the Envoy pods are healthy with 0 restarts. Perhaps that fixed it for me.

@PauloLuna
Copy link

I encountered the same problem when I renewed the certificates of a cluster managed by kubeadm, after rerunning the job contour-certgen-master the problem was solved

@youngnick
Copy link
Member

Looking at this issue, it seems like this problem is basically the symptom that something unrecoverable is wrong with your Contour install. From what I can see the initial problems are all resolved now as well. Are any watchers of this issue still having outstanding problems that haven't been resolved?

@Legion2
Copy link

Legion2 commented Aug 11, 2021

@youngnick were my comments #3192 (comment) #3192 (comment) addressed, I did not see them being referenced anywhere.

@sharpjs
Copy link

sharpjs commented Aug 12, 2021

@youngnick Since upgrading Kubernetes in my cluster to 1.20.x, I have seen a total of two Envoy restarts, as opposed to thousands before.

My Contour/Envoy install is directly from kubectl apply -fing the Contour manifests, no customization. I suspect the problem was some race between the shutdown-manager and envoy containers.

@youngnick
Copy link
Member

@Legion2 thanks, I was having trouble tracking everyone's problems since we're kinda all smooshed together here. Give me a sec to review.

@youngnick
Copy link
Member

@Legion2 I actually think that this issue is not the right place to work on your problem, since it doesn't sound like you have the problem of Envoy pods being stuck at unready? If not, we should move your discussion to a new issue, I think.

In that issue, we'll need a (redacted) copy of your Envoy YAML, so we can try and reproduce. Looking at our example YAMLs, I'm not sure how the situation you describe could arise, but there is a possibility we may need to handle the additional case in the shutdown manager.

I apologise for not requesting this earlier and having your comments get lost in the shuffle, but we should be able to get onto them with a new issue.

I'll give this another day or so for others to respond, but if noone else has any "Envoy is stuck as unready" problems, I'll close this one out.

@Legion2
Copy link

Legion2 commented Aug 12, 2021

We added resource requests to the containers, which reduced the health probe failures and container restarts. I will open a new issue if we face increased error rates again.

@youngnick
Copy link
Member

Thanks very much @Legion2, I'll close this one out then.

If anyone else finds this issue, please log a fresh issue, and link back to here if you think it seems similar.

@phaniraj75
Copy link

@youngnick @stevesloka we are facing the same issue whenever we are upgrading CNI plugin. The readiness probe of envoy pod responding with status code 503 with body "DRAINING".
One thing observed when this happens that restart count of containers in envoy pod always mismatch, the "shutdown-manager" has more restart count than envoy pod.

Contour version: v1.20.1

The installation done using: kubectl apply -f https://projectcontour.io/quickstart/contour.yaml

Following logs are repeatedly printed:

[2022-04-06 03:37:23.500][18][debug][router] [source/common/router/router.cc:1320] [C46933][S12387350822765833839] upstream headers complete: end_stream=false            │
│ [2022-04-06 03:37:23.500][18][debug][http] [source/common/http/conn_manager_impl.cc:1400] [C46933][S12387350822765833839] drain closing connection                        │
│ [2022-04-06 03:37:23.500][18][debug][http] [source/common/http/conn_manager_impl.cc:1472] [C46933][S12387350822765833839] encoding headers via codec (end_stream=false):  │
│ ':status', '503'                                                                                                                                                          │
│ 'content-type', 'text/plain; charset=UTF-8'                                                                                                                               │
│ 'cache-control', 'no-cache, max-age=0'                                                                                                                                    │
│ 'x-content-type-options', 'nosniff'                                                                                                                                       │
│ 'date', 'Wed, 06 Apr 2022 03:37:23 GMT'                                                                                                                                   │
│ 'server', 'envoy'                                                                                                                                                         │
│ 'x-envoy-upstream-service-time', '1'                                                                                                                                      │
│ 'connection', 'close'                                                                                                                                                     │
│                                                                                                                                                                           │
│ [2022-04-06 03:37:23.500][18][debug][client] [source/common/http/codec_client.cc:129] [C6] response complete                                                              │
│ [2022-04-06 03:37:23.500][18][debug][pool] [source/common/http/http1/conn_pool.cc:53] [C6] response complete                                                              │
│ [2022-04-06 03:37:23.500][18][debug][pool] [source/common/conn_pool/conn_pool_base.cc:210] [C6] destroying stream: 0 remaining                                            │
│ [2022-04-06 03:37:23.500][18][debug][connection] [source/common/network/connection_impl.cc:648] [C46933] remote close                                                     │
│ [2022-04-06 03:37:23.500][18][debug][connection] [source/common/network/connection_impl.cc:249] [C46933] closing socket: 0                                                │
│ [2022-04-06 03:37:23.500][18][debug][conn_handler] [source/server/active_stream_listener_base.cc:120] [C46933] adding to cleanup list                                     │
│ [2022-04-06 03:37:25.133][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:130] dns resolution for contour failed with c-ares status 11           │
│ [2022-04-06 03:37:25.134][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:236] dns resolution for contour completed with status 0                │
│ [2022-04-06 03:37:25.134][1][debug][upstream] [source/common/upstream/upstream_impl.cc:256] transport socket match, socket default selected for host with address 10.108. │
│ [2022-04-06 03:37:25.135][1][debug][upstream] [source/common/upstream/strict_dns_cluster.cc:178] DNS refresh rate reset for contour, refresh rate 5000 ms

Shutdown-manager hook logs

time="2022-04-04T05:12:08Z" level=info msg="started envoy shutdown manager" context=shutdown-manager

Status of envoy pods:

Name:         envoy-qknm9
Namespace:    projectcontour
Priority:     0
Node:         irina-staf-debug-6.englab.juniper.net/192.169.105.6
Start Time:   Thu, 31 Mar 2022 10:15:43 -0700
Labels:       app=envoy
              controller-revision-hash=5b5b964765
              pod-template-generation=4
Annotations:  kube-manager.juniper.net/vm-uuid: efeed61d-d084-4521-87ca-2b8789e39324
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 8002
              prometheus.io/scrape: true
Status:       Running
IP:           10.108.69.4
IPs:
  IP:           10.108.69.4
  IP:           fd85:ee78:d8a6:8608::1:504
Controlled By:  DaemonSet/envoy
Init Containers:
  envoy-initconfig:
    Container ID:  cri-o://7af2eee55a5e954055f70fcec97e2e8a1f5c82023a8867fe6b5cea8283427187
    Image:         ghcr.io/projectcontour/contour:v1.20.1
    Image ID:      ghcr.io/projectcontour/contour@sha256:2fd5abb2cb5feb91f1de1cfc7abd637c3eff90e77896633176735b0b0419a308
    Port:          <none>
    Host Port:     <none>
    Command:
      contour
    Args:
      bootstrap
      /config/envoy.json
      --xds-address=contour
      --xds-port=8001
      --xds-resource-version=v3
      --resources-dir=/config/resources
      --envoy-cafile=/certs/ca.crt
      --envoy-cert-file=/certs/tls.crt
      --envoy-key-file=/certs/tls.key
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 31 Mar 2022 10:16:16 -0700
      Finished:     Thu, 31 Mar 2022 10:16:16 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      CONTOUR_NAMESPACE:  projectcontour (v1:metadata.namespace)
    Mounts:
      /certs from envoycert (ro)
      /config from envoy-config (rw)
Containers:
  shutdown-manager:
    Container ID:  cri-o://21cfc6c35766574d83318d7774c9b60bb6c527aad93df0ce14e69cdfb8b3cf98
    Image:         ghcr.io/projectcontour/contour:v1.20.1
    Image ID:      ghcr.io/projectcontour/contour@sha256:2fd5abb2cb5feb91f1de1cfc7abd637c3eff90e77896633176735b0b0419a308
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/contour
    Args:
      envoy
      shutdown-manager
    State:          Running
      Started:      Sun, 03 Apr 2022 22:12:08 -0700
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 31 Mar 2022 10:16:17 -0700
      Finished:     Sun, 03 Apr 2022 22:12:08 -0700
    Ready:          True
    Restart Count:  1
    Liveness:       http-get http://:8090/healthz delay=3s timeout=1s period=10s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /admin from envoy-admin (rw)
  envoy:
    Container ID:  cri-o://fdd88d44123af71da424bbf6e4d37f9d3732d5298b1dfcca45df935315266102
    Image:         docker.io/envoyproxy/envoy:v1.21.1
    Image ID:      docker.io/envoyproxy/envoy@sha256:9e6ea1a08e8b72fc51b16261fe8033f7618b670b0c1c18e5b092191850c0747a
    Ports:         8080/TCP, 8443/TCP
    Host Ports:    80/TCP, 443/TCP
    Command:
      envoy
    Args:
      -c
      /config/envoy.json
      --service-cluster $(CONTOUR_NAMESPACE)
      --service-node $(ENVOY_POD_NAME)
      --log-level debug
      --drain-time-s 60
    State:          Running
      Started:      Thu, 31 Mar 2022 10:16:17 -0700
    Ready:          False
    Restart Count:  0
    Readiness:      http-get http://:8002/ready delay=3s timeout=1s period=10s #success=1 #failure=3
    Environment:
      CONTOUR_NAMESPACE:  projectcontour (v1:metadata.namespace)
      ENVOY_POD_NAME:     envoy-qknm9 (v1:metadata.name)
    Mounts:
      /admin from envoy-admin (rw)
      /certs from envoycert (ro)
      /config from envoy-config (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  envoy-admin:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  envoy-config:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  envoycert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  envoycert
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason            Age                       From     Message
  ----     ------            ----                      ----     -------
  Warning  DNSConfigForming  5m27s (x6237 over 5d10h)  kubelet  Search Line limits were exceeded, some search paths have been omitted, the applied search line is: projectcontour.svc.cluster.local svc.cluster.local cluster.local contrail.juniper.net englab.juniper.net juniper.net
  Warning  Unhealthy         18s (x16798 over 46h)     kubelet  Readiness probe failed: HTTP probe failed with statuscode: 503

@youngnick
Copy link
Member

Thanks for the report and the logs @phaniraj75. Would you please be able to log this as a separate issue though? That way we can avoid updating everyone involved in this train.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor.
Projects
None yet
Development

No branches or pull requests