Observing "no healthy upstream" for new deployments until ambassador pods restarted #3324

coala-svn · 2021-04-02T16:25:05Z

Description of the problem
I am facing a very strange problem. Our IT wants us to migrate Application testing pipeline to a new cluster. After deploying ambassador with helm (originally it was 1.12.0) I tested the deployments of our applications: all the deployments were successful, however on access to the application I constantly got an error "no healthy upstream" (the same deployment works in the old cluster).

At some point in time I learned about released 1.12.1 and upgraded the ambassador with "helm upgrade" to 1.12.1. After that all the old not working application deployments started to work without any additional changes. But every new deployment had the same issue: the error "no healthy upstream". Eventually ambassador was upgraded to 1.12.2 with the same effect: not working the old deployments started to work without any changes and every new deployment had an error "no healthy upstream".

Investigation of connectivity confirmed that the application is accessible with curl from ambassador pod via connection to the app service, as well as to the app in pod directly. However, external requests to the application always ended up with "no healthy upstream".

Now, if the ambassador pod is killed (replica count was reduced to 1 for simplifying logs analysis) and the deployment/replicaset replaces it with a new pod the issue is resolved - all not working deployments start working (it was tested 3 times).

Details on the current deployment:

$ helm -n ambassador list
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
ambassador      ambassador      14              2021-03-31 10:20:18.8370383 -0400 EDT   deployed        ambassador-6.6.2        1.12.2

Is it something that I might be missing during the deployment of ambassador?nd concise description of what the bug is.

Expected behavior
All the new application deployments start working without a need to restart ambassador pods

Versions:

Ambassador: 1.12.2 (1.12.0, 1.12.1)
Kubernetes environment: Azure Kubernetes Service (AKS) - privatelink custer (i,e. no access from the public internet and only internal LBs - Annotation "service.beta.kubernetes.io/azure-load-balancer-internal" is set to "true")
Version: v1.18.14

Additional context
None. I am not sure if it is a bug or not. I would appreciate any workaround for our environment.

The text was updated successfully, but these errors were encountered:

coala-svn · 2021-04-05T20:10:49Z

After downgrading Ambassador version to 1.11.2 the issue is not reproducible. So, looks like it is an issue in 1.12.x

rdmoore · 2021-04-05T20:17:01Z

We are also seeing a similar (same?) problem with ambassador 1.12.1 deployed into our nonprod environment.
A fresh ambassador pod works like a champ. Any change doesn't seem to be reflected into the envoy configuration.
Example events that don't seem to cause a reconfiguration:

a HPA scale-down (removal of a pod),
a new deployment
a rollout restart
In all cases, rollout restart of ambassador resolves the issue.

In researching the issue, it looks like 1.12.1 switched to using EDS. Is it possible that the EDS service is not reflecting cluster changes?

rhysm · 2021-04-06T22:13:03Z

Also facing a similar issue. 1.12.0 was my first use of Ambassador and I thought I had misconfigured something. Rolling back to 1.11.2 resolved the issue.

rdmoore · 2021-04-07T14:45:06Z

I started monitoring the snapshots/snapshot.yaml file while performing rollout restart of a deployment. Ambassador creates the correct information in this file at startup, but it does not get updated with new endpoint IPs when I roll a deployment. The ambassador documentation indicates that this is likely a configuration issue of some sort.

Is this file still is expected to be updated (with the latest EDS changes)?
what kinds of issues might cause a problem merging in new k8s information but not cause an issue creating the original file?

rhs · 2021-04-08T14:45:53Z

Can you post the Mapping resources for which you are experiencing this behavior?

coala-svn · 2021-04-08T21:38:38Z

Here is an example of mapping:

$ kubectl -n kangaroo277id100006 describe mapping ambassador-ms-service-0

Name:         ambassador-ms-service-0
Namespace:    kangaroo277id100006
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  getambassador.io/resource-changed: true
              meta.helm.sh/release-name: amb-rules
              meta.helm.sh/release-namespace: kangaroo277id100006
API Version:  getambassador.io/v2
Kind:         Mapping
Metadata:
  Creation Timestamp:  2021-04-08T19:53:49Z
  Generation:          1
  Managed Fields:
    API Version:  getambassador.io/v2
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:getambassador.io/resource-changed:
          f:meta.helm.sh/release-name:
          f:meta.helm.sh/release-namespace:
        f:labels:
          .:
          f:app.kubernetes.io/managed-by:
      f:spec:
        .:
        f:connect_timeout_ms:
        f:host:
        f:idle_timeout_ms:
        f:load_balancer:
          .:
          f:cookie:
            .:
            f:name:
            f:path:
            f:ttl:
          f:policy:
        f:prefix:
        f:resolver:
        f:rewrite:
        f:service:
        f:timeout_ms:
    Manager:         Go-http-client
    Operation:       Update
    Time:            2021-04-08T19:53:49Z
  Resource Version:  7903038
  Self Link:         /apis/getambassador.io/v2/namespaces/kangaroo277id100006/mappings/ambassador-ms-service-0
  UID:               6653b0e3-3d3c-4260-a945-284e807a66f7
Spec:
  connect_timeout_ms:  6000
  Host:                wcdp-windchill-kangaroo277id100006.rd-plm-devops.bdns.ptc.com
  idle_timeout_ms:     5000000
  load_balancer:
    Cookie:
      Name:    sticky-cookie-0
      Path:    /Windchill
      Ttl:     600s
    Policy:    ring_hash
  Prefix:      /Windchill
  Resolver:    endpoint
  Rewrite:     /Windchill
  Service:     ms-service-kangaroo-0.kangaroo277id100006.svc.cluster.local:8080
  timeout_ms:  0
Events:        <none>

rhs · 2021-04-09T19:47:53Z

I believe if you drop the .svc.cluster.local from the service name it should fix the problem. When you are using the kubernetes endpoint routing resolver the service field refers directly to a kubernetes resource, not to a dns name. The .svc.cluster.local suffix is added by the kubernetes dns server so it is a bit weird to use it when you aren't doing a dns lookup.

That said this is a bug because we used to allow that and a) we shouldn't disallow that without a deprecation period, and b) we should also be logging it as an error.

rdmoore · 2021-04-10T15:03:10Z

Thanks!. That appears to be exactly my issue. Re-reading the documentation, I see that the DNS name is not recommended - only the claim that it might work. Why did I not notice this previously??? I have tested this change successfully with a few mapping files.

coala-svn · 2021-04-12T15:01:20Z

@rhs - thanks for letting us know a workaround. However, the problem here is that we specifically were told by someone from Dataware team to use the suffix .svc.cluster.local for the service when they helped us with the update of our application deployment for Ambassador integration (I was not a part of that discussion and I learned about the recommendation only today when we internally discussed testing of the potential workaround).

illinar · 2021-04-13T04:52:26Z

We observed very similar behavior, except that "no healthy upstream" error went away once mapping is re-loaded. We tried removing "http://" prefix from the service name as was suggested in Slack for a similar situation and it seemed to do the trick. But it is unclear what is the underlying cause and what is the correct way of specifying mappings to avoid this sort of scenarios.

khussey · 2021-04-20T19:26:56Z

This is fixed in Ambassador 1.13.0, which is now available.

coala-svn · 2021-04-21T19:05:24Z

Confirmed.

Thank you guys for the prompt fix!

wissam-launchtrip · 2021-05-24T09:18:57Z

We are noticing this behavior on 1.13.5 still. Please advise.

esmet · 2021-05-24T13:28:01Z

@wissam-launchtrip can you go into a bit more detail? Are you seeing this exact issue or something similar? Anything that can help us verify the report and reproduce the issue for a possible fix 👍

wissam-launchtrip · 2021-06-19T21:15:44Z

No actually it's a different issue.
Upstream Services get disconnected for no clear reason! And we get "no healthy upstream" error.
This happens after a few hours from last deployment in the cluster.
If we make a deployment in the cluster, the error disappears.

khussey added the t:bug Something isn't working label Apr 8, 2021

khussey assigned rhs Apr 8, 2021

khussey added this to the 2021 Cycle 3 milestone Apr 8, 2021

khussey added the w:5 Targeted for fifth week of development cycle label Apr 14, 2021

khussey closed this as completed Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observing "no healthy upstream" for new deployments until ambassador pods restarted #3324

Observing "no healthy upstream" for new deployments until ambassador pods restarted #3324

coala-svn commented Apr 2, 2021

coala-svn commented Apr 5, 2021 •

edited

Loading

rdmoore commented Apr 5, 2021 •

edited

Loading

rhysm commented Apr 6, 2021

rdmoore commented Apr 7, 2021

rhs commented Apr 8, 2021

coala-svn commented Apr 8, 2021

rhs commented Apr 9, 2021

rdmoore commented Apr 10, 2021

coala-svn commented Apr 12, 2021

illinar commented Apr 13, 2021

khussey commented Apr 20, 2021

coala-svn commented Apr 21, 2021 •

edited

Loading

wissam-launchtrip commented May 24, 2021

esmet commented May 24, 2021

wissam-launchtrip commented Jun 19, 2021

Observing "no healthy upstream" for new deployments until ambassador pods restarted #3324

Observing "no healthy upstream" for new deployments until ambassador pods restarted #3324

Comments

coala-svn commented Apr 2, 2021

coala-svn commented Apr 5, 2021 • edited Loading

rdmoore commented Apr 5, 2021 • edited Loading

rhysm commented Apr 6, 2021

rdmoore commented Apr 7, 2021

rhs commented Apr 8, 2021

coala-svn commented Apr 8, 2021

rhs commented Apr 9, 2021

rdmoore commented Apr 10, 2021

coala-svn commented Apr 12, 2021

illinar commented Apr 13, 2021

khussey commented Apr 20, 2021

coala-svn commented Apr 21, 2021 • edited Loading

wissam-launchtrip commented May 24, 2021

esmet commented May 24, 2021

wissam-launchtrip commented Jun 19, 2021

coala-svn commented Apr 5, 2021 •

edited

Loading

rdmoore commented Apr 5, 2021 •

edited

Loading

coala-svn commented Apr 21, 2021 •

edited

Loading