-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xDS updates are not ordered #1178
Comments
Thank you for raising this issue. I'm sorry this is a problem for you. We are working to make the split envoy and contour deployment the default by the 0.15 milestone #881 . To implement ordering we would probably have to switch to ADS. I know you've asked for this before (sorry I cannot find the issue at the moment). My answer is the same as previously, implementing ADS is not on the current roadmap. Can you please tell me more about the sequence of events you're seeing. When you say |
For the five 👍 's I encourage you to join the next community meeting https://projectcontour.io/community/ So we can talk about the issue in a higher bandwidth form. I'm not saying no to ADS, I just want to understand the issues you're having. If that means I need to rescope ADS and put it on the road map, so be it, but I need your help to understand the issues you are experiencing. |
From the Envoy docs:
Challenging but not impossible. Also, we're running a single Contour because of the various other protocol issues. It's handling 100+ Envoys and that could be your recommended split-pod Deployment (i.e.
That would probably be the best approach but some coordinated locking on the individual streams could work too. From what I see in logs LDS/CDS/RDS have no resource hints so you can count on only 1 Running this sequence in a loop seems relatively (to switching to ADS) easy I've built this type of "ticketing system" using Go channels before. It's not trivial but not very difficult either.
No worries. I previously asked about incremental xDS #1058 |
Since the Contour GRPC server already multiplexes Envoy requests using the resource type URL, we can support ADS simply by enabling it in the bootstrap configuration. This fixes projectcontour#1178. Signed-off-by: James Peach <[email protected]>
Since the Contour GRPC server already multiplexes Envoy requests using the resource type URL, we can support ADS simply by enabling it in the bootstrap configuration. This fixes projectcontour#1178. This fixes projectcontour#1286. Signed-off-by: James Peach <[email protected]>
Since the Contour GRPC server already multiplexes Envoy requests using the resource type URL, we can support ADS simply by enabling it in the bootstrap configuration. This fixes projectcontour#1178. This fixes projectcontour#1286. Signed-off-by: James Peach <[email protected]>
I generated cluster churn by concurrently adding and deleting HTTPProxy objects. While this was happening, I sent traffic to a different service in the same cluster that I wasn't modifying. The target service never dropped any requests. Restarting Contour didn't cause any dropped requests either. It's perfectly possible that there is a scenario where updating the Envoy config could cause dropped requests, but I haven't reproduced it yet. |
Tested a new scenario. This theory is that when a new envoy pod starts up, it will start taking requests before it has been fully programmed by Contour. If this is the case, we would expect that when we add new envoy pods, there is a chance that the new envoy would serve an error. I experimented with this in a 5 node cluster running envoy as a deployment. When I scale the deployment from 1 to 5 replicas, I never see any dropped requests. When I scale down, however, it is possible for connections to be refused, which could be addressed by improving the graceful drain of the envoy pods. |
I think it'll be harder to reproduce this during an Envoy restart vs. a Contour restart. On Envoy start, the requests are naturally ordered: as an example, Envoy won't ask for endpoints for a cluster it doesn't know about -that is to say, it won't even create a gRPC stream for EDS, until it receives CDS (the hint is the cluster name); note that RDS is scoped to the listener (e.g. ingress_http or ingress_https), so it's possible for RDS to include routes for clusters unknown to Envoy (and ordering may matter in this case). Now, on Contour start (with a pre-configured Envoy), Envoy will re-connect its gRPC streams in an un-ordered fashion. If resources have changed in between the disconnect and the reconnect, then bad things may happen if the responses are not ordered. This was definitely more prevalent before this fix because a single contour-initiated endpoint update would affect all of them (now, it only affects that very one that changed). But you still have the issue of RDS including route configs of clusters that Envoy doesn't know about. |
So if there are changes when Contour isn't running, then you can get user-visible errors. However, when Contour starts coming up, those errors could transiently changes (e.g. 404 --> 503 -> 200). Is that your observation? |
I'm just saying that is one way to reproduce. The root of the problem is that Contour receives un-ordered updates from k8s -expected, and should order the xDS responses (which is not happening). |
On Jan 3, 2020, at 3:59 AM, Laurent Rouquette ***@***.***> wrote:
I'm just saying that is one way to reproduce. The root of the problem is that Contour receives un-ordered updates from k8s -expected, and should order the xDS responses (which is not happening).
Yup, I’m trying to reproduce the problem so that I can verify whether ADS fixes it. I’ve tried a few things but have not had any success yet, so if you have and suggestions I’d love to try those too :)
|
Yeah, I hear ya :-) It's definitely not easy without explicitly coding the dis-order (which won't work for your test). We've seen this issue more often on bigger clusters, say 50 envoy nodes. Perhaps you can try that. Happy to see ADS in the picture though, thats' awesome! |
On Jan 4, 2020, at 9:12 AM, Laurent Rouquette ***@***.***> wrote:
Happy to see ADS in the picture though, thats' awesome!
If this reproduces in your deployment, I have an ADS latch you could test.
|
We worked around it, so I'll have to go back and see if I can repro. I'll let you know. |
Since the Contour GRPC server already multiplexes Envoy requests using the resource type URL, we can support ADS simply by enabling it in the bootstrap configuration. This fixes projectcontour#1178. This fixes projectcontour#1286. Signed-off-by: James Peach <[email protected]>
Since the Contour GRPC server already multiplexes Envoy requests using the resource type URL, we can support ADS simply by enabling it in the bootstrap configuration. This fixes projectcontour#1178. This fixes projectcontour#1286. Signed-off-by: James Peach <[email protected]>
The Contour project currently lacks enough contributors to adequately respond to all Issues. This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all Issues. This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
I don't see any evidence in Contour's source that xDS updates are ordered correctly
https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#eventual-consistency-considerations
Without ordering the update order would be subject to whichever goroutine was scheduled first. This race condition matches the behavior we see: a Contour restart doesn't always cause Envoy 404s.
This is an issue for us because we can't restart or upgrade Contour without causing downtime. Clients see 404s for the period of time after the
ingress_http
andingress_https
listeners are torn down, and before it takes the restarted Contour to respond to all the DiscoveryRequests.We're running a split-pod Deployment. Did you test restarting Contour when it's not in the same pod as Envoy? If so, is this expected behavior?
The text was updated successfully, but these errors were encountered: