Allow customizing initial_fetch_timeout in the envoy sidecar for Consul Service Mesh #17283

komapa · 2023-05-09T21:41:03Z

Please see istio/istio#31825 and also you can see AWS is doing the "right" thing and defaulting it to 0 with the option to modify it in the rare case that a different behavior is desired: https://docs.aws.amazon.com/app-mesh/latest/userguide/envoy-config.html

Feature Description

We are running into a pretty unpleasant problem where Envoy sidecar reaches the default 15s initial_fetch_timeout and then continues with starting up and responding with LIVE to the /ready endpoint while it has NOT loaded all upstreams for all clusters from Consul.

We believe Consul should default initial_fetch_timeout to 0 because starting the Envoy proxy sidecar with incorrect configuration is much worse than not starting at all (which we can handle much easier)

Use Case(s)

Not having broken service mesh :)

The text was updated successfully, but these errors were encountered:

luckymike · 2023-05-09T22:19:54Z

To provide a little more color on why this is important: when envoy starts in this state, it continuously returns 503s for the upstreams that failed to populate, and the only solution is to restart the sidecar container (or kill the instance entirely).

david-yu · 2023-05-11T17:46:51Z

Hi @komapa @luckymike from reviewing those links you provided it does seem like the best thing to do for default config is set this to 0. Is there a chance though that initial_fetch_timeout would ever need to be configured to something that is not 0 dynamically?

komapa · 2023-05-12T03:59:42Z

Hi @komapa @luckymike from reviewing those links you provided it does seem like the best thing to do for default config is set this to 0. Is there a chance though that initial_fetch_timeout would ever need to be configured to something that is not 0 dynamically?

Thank you for picking this ticket up @david-yu. I cannot think of a case in our setup where that would be needed but we obviously do not represent all of the users :) If it is not terribly difficult to make it an option, I would advise you do so.

david-yu · 2023-05-15T17:53:36Z

Hi @komapa We just merged a PR that sets initial_fetch_timeout to 0 by default which should be released in 1.14.x and 1.15.x later this week. As far as customizing that option, we'll wait for further feedback before applying the flags to do so on Consul and Consul K8s. We will leave this issue open since we've only applied a more reasonable default setting but have not implemented the setting of arbitrary values for initial_fetch_timeout.

david-yu · 2023-06-01T21:30:27Z

Hi @komapa Unfortunately we'll need to roll this fix back on 1.14.x and 1.15.x in the interim as we've discovered that our implementation causes issues on Ingress, Terminating and Mesh Gateways based on further testing. We're hoping to re-release this feature again in the future.

komapa · 2023-06-16T03:53:19Z

Hi @komapa Unfortunately we'll need to roll this fix back on 1.14.x and 1.15.x in the interim as we've discovered that our implementation causes issues on Ingress, Terminating and Mesh Gateways based on further testing. We're hoping to re-release this feature again in the future.

That is very unfortunate. Do you have any public details on what the issue is with the listed software? Also, instead of reverting, can we make it configurable so this way we can make it zero just for the sidecars?

Thank you!

david-yu · 2023-06-16T23:41:11Z

Out of curiosity @komapa do you use any terminating or mesh gateways in your environment? We need to do more investigation to understand how to enable this. It's a lot trickier than we thought.

komapa · 2023-06-20T16:30:47Z

We do not actively use terminating gateway functionality and we never used any mesh gateways in our setup. We did upgrade our work in progress Kubernetes clusters and we do see there that the ingress gateways on 1.15.3 do seem to be having problems that I can take a closer look if needed.

How can we help so you can help us? :)

komapa · 2023-06-29T17:36:53Z

Bump

DanStough · 2023-07-03T16:22:40Z

Hi @komapa 👋. I'm working on a permanent fix now that I am pretty confident will be in the next set of patch releases. Thanks for working with us while we get this sorted out.

The original changes should have been reverted for 1.15.3, so it might be unrelated if you're having problems with ingress gateways. Would be curious to know the issues if you don't mind reporting here or opening a new issue.

komapa · 2023-07-20T17:35:15Z

Thank you for fixing this. Greatly appreciated! I will report the ingress gateway problem if it happens again.

david-yu · 2023-07-27T03:04:11Z

Will go ahead and close as we currently do not plan on making this customizable at the moment. For folks that find this issue please open up a new issue if you are looking to customize the initial_fetch_timeout config for Envoy.

DanStough self-assigned this May 11, 2023

This was referenced May 11, 2023

fix(connect envoy): set initial_fetch_timeout to wait for initial xDS… #17317

Merged

fix: set initial_fetch_timeout to wait for initial xDS… hashicorp/consul-dataplane#104

Merged

DanStough mentioned this issue May 15, 2023

Backport of fix(connect envoy): set initial_fetch_timeout to wait for initial xDS into release/1.14 #17372

Merged

4 tasks

This was referenced Jul 5, 2023

[OSS] Fix initial_fetch_timeout to wait for all xDS resources #18024

Merged

fix(connect): set initial_fetch_time to wait indefinitely hashicorp/consul-dataplane#140

Merged

david-yu closed this as completed Jul 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow customizing initial_fetch_timeout in the envoy sidecar for Consul Service Mesh #17283

Allow customizing initial_fetch_timeout in the envoy sidecar for Consul Service Mesh #17283

komapa commented May 9, 2023 •

edited

Loading

luckymike commented May 9, 2023

david-yu commented May 11, 2023

komapa commented May 12, 2023

david-yu commented May 15, 2023

david-yu commented Jun 1, 2023

komapa commented Jun 16, 2023

david-yu commented Jun 16, 2023

komapa commented Jun 20, 2023

komapa commented Jun 29, 2023

DanStough commented Jul 3, 2023

komapa commented Jul 20, 2023 •

edited

Loading

david-yu commented Jul 27, 2023 •

edited

Loading

Allow customizing initial_fetch_timeout in the envoy sidecar for Consul Service Mesh #17283

Allow customizing initial_fetch_timeout in the envoy sidecar for Consul Service Mesh #17283

Comments

komapa commented May 9, 2023 • edited Loading

Feature Description

Use Case(s)

luckymike commented May 9, 2023

david-yu commented May 11, 2023

komapa commented May 12, 2023

david-yu commented May 15, 2023

david-yu commented Jun 1, 2023

komapa commented Jun 16, 2023

david-yu commented Jun 16, 2023

komapa commented Jun 20, 2023

komapa commented Jun 29, 2023

DanStough commented Jul 3, 2023

komapa commented Jul 20, 2023 • edited Loading

david-yu commented Jul 27, 2023 • edited Loading

komapa commented May 9, 2023 •

edited

Loading

komapa commented Jul 20, 2023 •

edited

Loading

david-yu commented Jul 27, 2023 •

edited

Loading