xDS updates are not ordered #1178

phylake · 2019-06-19T15:09:05Z

I don't see any evidence in Contour's source that xDS updates are ordered correctly

https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#eventual-consistency-considerations

Without ordering the update order would be subject to whichever goroutine was scheduled first. This race condition matches the behavior we see: a Contour restart doesn't always cause Envoy 404s.

This is an issue for us because we can't restart or upgrade Contour without causing downtime. Clients see 404s for the period of time after the ingress_http and ingress_https listeners are torn down, and before it takes the restarted Contour to respond to all the DiscoveryRequests.

We're running a split-pod Deployment. Did you test restarting Contour when it's not in the same pod as Envoy? If so, is this expected behavior?

The text was updated successfully, but these errors were encountered:

davecheney · 2019-06-20T01:28:44Z

Thank you for raising this issue. I'm sorry this is a problem for you. We are working to make the split envoy and contour deployment the default by the 0.15 milestone #881 .

To implement ordering we would probably have to switch to ADS. I know you've asked for this before (sorry I cannot find the issue at the moment). My answer is the same as previously, implementing ADS is not on the current roadmap.

Can you please tell me more about the sequence of events you're seeing. When you say ingress_http and ingress_https are torn down, this is not something I would expect to happen unless all ingress documents were removed.

davecheney · 2019-06-20T01:30:37Z

For the five 👍 's I encourage you to join the next community meeting

https://projectcontour.io/community/

So we can talk about the issue in a higher bandwidth form.

I'm not saying no to ADS, I just want to understand the issues you're having. If that means I need to rescope ADS and put it on the road map, so be it, but I need your help to understand the issues you are experiencing.

phylake · 2019-06-20T15:27:23Z

From the Envoy docs:

It’s challenging to provide the above guarantees on sequencing to avoid traffic drop when management servers are distributed

Challenging but not impossible. Also, we're running a single Contour because of the various other protocol issues. It's handling 100+ Envoys and that could be your recommended split-pod Deployment (i.e. replicas: 1). The above comment is specific to distributed servers (plural), implying a singleton management server wouldn't face the same challenges.

To implement ordering we would probably have to switch to ADS

That would probably be the best approach but some coordinated locking on the individual streams could work too. From what I see in logs LDS/CDS/RDS have no resource hints so you can count on only 1 DiscoveryRequest of each. EDS is variable but known by Contour so it seems feasible.

Running this sequence in a loop seems relatively (to switching to ADS) easy
CDS > variable but known EDS > LDS > RDS

I've built this type of "ticketing system" using Go channels before. It's not trivial but not very difficult either.

I know you've asked for this before (sorry I cannot find the issue at the moment)

No worries. I previously asked about incremental xDS #1058

Since the Contour GRPC server already multiplexes Envoy requests using the resource type URL, we can support ADS simply by enabling it in the bootstrap configuration. This fixes projectcontour#1178. Signed-off-by: James Peach <[email protected]>

Since the Contour GRPC server already multiplexes Envoy requests using the resource type URL, we can support ADS simply by enabling it in the bootstrap configuration. This fixes projectcontour#1178. This fixes projectcontour#1286. Signed-off-by: James Peach <[email protected]>

jpeach · 2019-12-17T00:01:42Z

I generated cluster churn by concurrently adding and deleting HTTPProxy objects. While this was happening, I sent traffic to a different service in the same cluster that I wasn't modifying. The target service never dropped any requests. Restarting Contour didn't cause any dropped requests either.

It's perfectly possible that there is a scenario where updating the Envoy config could cause dropped requests, but I haven't reproduced it yet.

jpeach · 2019-12-20T05:12:47Z

Tested a new scenario. This theory is that when a new envoy pod starts up, it will start taking requests before it has been fully programmed by Contour. If this is the case, we would expect that when we add new envoy pods, there is a chance that the new envoy would serve an error.

I experimented with this in a 5 node cluster running envoy as a deployment. When I scale the deployment from 1 to 5 replicas, I never see any dropped requests. When I scale down, however, it is possible for connections to be refused, which could be addressed by improving the graceful drain of the envoy pods.

lrouquette · 2019-12-20T20:11:24Z

I think it'll be harder to reproduce this during an Envoy restart vs. a Contour restart. On Envoy start, the requests are naturally ordered: as an example, Envoy won't ask for endpoints for a cluster it doesn't know about -that is to say, it won't even create a gRPC stream for EDS, until it receives CDS (the hint is the cluster name); note that RDS is scoped to the listener (e.g. ingress_http or ingress_https), so it's possible for RDS to include routes for clusters unknown to Envoy (and ordering may matter in this case).

Now, on Contour start (with a pre-configured Envoy), Envoy will re-connect its gRPC streams in an un-ordered fashion. If resources have changed in between the disconnect and the reconnect, then bad things may happen if the responses are not ordered. This was definitely more prevalent before this fix because a single contour-initiated endpoint update would affect all of them (now, it only affects that very one that changed). But you still have the issue of RDS including route configs of clusters that Envoy doesn't know about.

jpeach · 2019-12-23T00:14:56Z

So if there are changes when Contour isn't running, then you can get user-visible errors. However, when Contour starts coming up, those errors could transiently changes (e.g. 404 --> 503 -> 200). Is that your observation?

lrouquette · 2020-01-02T16:59:17Z

I'm just saying that is one way to reproduce. The root of the problem is that Contour receives un-ordered updates from k8s -expected, and should order the xDS responses (which is not happening).

jpeach · 2020-01-02T22:01:31Z

On Jan 3, 2020, at 3:59 AM, Laurent Rouquette ***@***.***> wrote: I'm just saying that is one way to reproduce. The root of the problem is that Contour receives un-ordered updates from k8s -expected, and should order the xDS responses (which is not happening).

Yup, I’m trying to reproduce the problem so that I can verify whether ADS fixes it. I’ve tried a few things but have not had any success yet, so if you have and suggestions I’d love to try those too :)

lrouquette · 2020-01-03T22:12:54Z

Yeah, I hear ya :-) It's definitely not easy without explicitly coding the dis-order (which won't work for your test). We've seen this issue more often on bigger clusters, say 50 envoy nodes. Perhaps you can try that.

Happy to see ADS in the picture though, thats' awesome!

jpeach · 2020-01-05T06:05:32Z

On Jan 4, 2020, at 9:12 AM, Laurent Rouquette ***@***.***> wrote: Happy to see ADS in the picture though, thats' awesome!

If this reproduces in your deployment, I have an ADS latch you could test.

lrouquette · 2020-01-06T16:47:42Z

We worked around it, so I'll have to go back and see if I can repro. I'll let you know.

Since the Contour GRPC server already multiplexes Envoy requests using the resource type URL, we can support ADS simply by enabling it in the bootstrap configuration. This fixes projectcontour#1178. This fixes projectcontour#1286. Signed-off-by: James Peach <[email protected]>

github-actions · 2024-08-23T00:22:39Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-09-25T00:25:29Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

davecheney added the blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. label Jun 20, 2019

davecheney mentioned this issue Jul 2, 2019

Contour should move to a leader election model #1204

Closed

3 tasks

phylake mentioned this issue Jul 31, 2019

Wait to discover all K8s resources before sending xDS responses #1280

Closed

3 tasks

davecheney removed the blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. label Oct 10, 2019

davecheney added this to the Backlog milestone Oct 28, 2019

jpeach mentioned this issue Nov 28, 2019

Contour should switch to using ADS for all configuration streams #1286

Closed

jpeach mentioned this issue Dec 20, 2019

internal/envoy: enable Envoy ADS support #1864

Closed

jpeach mentioned this issue Jan 30, 2020

Envoy can serve 503 when transitioning between services. #2151

Open

skriss removed this from the Backlog milestone Jul 25, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xDS updates are not ordered #1178

xDS updates are not ordered #1178

phylake commented Jun 19, 2019

davecheney commented Jun 20, 2019

davecheney commented Jun 20, 2019

phylake commented Jun 20, 2019

jpeach commented Dec 17, 2019

jpeach commented Dec 20, 2019

lrouquette commented Dec 20, 2019

jpeach commented Dec 23, 2019

lrouquette commented Jan 2, 2020

jpeach commented Jan 2, 2020 via email

lrouquette commented Jan 3, 2020

jpeach commented Jan 5, 2020 via email

lrouquette commented Jan 6, 2020

github-actions bot commented Aug 23, 2024

github-actions bot commented Sep 25, 2024

xDS updates are not ordered #1178

xDS updates are not ordered #1178

Comments

phylake commented Jun 19, 2019

davecheney commented Jun 20, 2019

davecheney commented Jun 20, 2019

phylake commented Jun 20, 2019

jpeach commented Dec 17, 2019

jpeach commented Dec 20, 2019

lrouquette commented Dec 20, 2019

jpeach commented Dec 23, 2019

lrouquette commented Jan 2, 2020

jpeach commented Jan 2, 2020 via email

lrouquette commented Jan 3, 2020

jpeach commented Jan 5, 2020 via email

lrouquette commented Jan 6, 2020

github-actions bot commented Aug 23, 2024

github-actions bot commented Sep 25, 2024