Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xDS updates are not ordered #1178

Closed
phylake opened this issue Jun 19, 2019 · 14 comments
Closed

xDS updates are not ordered #1178

phylake opened this issue Jun 19, 2019 · 14 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@phylake
Copy link

phylake commented Jun 19, 2019

I don't see any evidence in Contour's source that xDS updates are ordered correctly

https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#eventual-consistency-considerations

Without ordering the update order would be subject to whichever goroutine was scheduled first. This race condition matches the behavior we see: a Contour restart doesn't always cause Envoy 404s.

This is an issue for us because we can't restart or upgrade Contour without causing downtime. Clients see 404s for the period of time after the ingress_http and ingress_https listeners are torn down, and before it takes the restarted Contour to respond to all the DiscoveryRequests.

We're running a split-pod Deployment. Did you test restarting Contour when it's not in the same pod as Envoy? If so, is this expected behavior?

@davecheney
Copy link
Contributor

Thank you for raising this issue. I'm sorry this is a problem for you. We are working to make the split envoy and contour deployment the default by the 0.15 milestone #881 .

To implement ordering we would probably have to switch to ADS. I know you've asked for this before (sorry I cannot find the issue at the moment). My answer is the same as previously, implementing ADS is not on the current roadmap.

Can you please tell me more about the sequence of events you're seeing. When you say ingress_http and ingress_https are torn down, this is not something I would expect to happen unless all ingress documents were removed.

@davecheney davecheney added the blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. label Jun 20, 2019
@davecheney
Copy link
Contributor

For the five 👍 's I encourage you to join the next community meeting

https://projectcontour.io/community/

So we can talk about the issue in a higher bandwidth form.

I'm not saying no to ADS, I just want to understand the issues you're having. If that means I need to rescope ADS and put it on the road map, so be it, but I need your help to understand the issues you are experiencing.

@phylake
Copy link
Author

phylake commented Jun 20, 2019

From the Envoy docs:

It’s challenging to provide the above guarantees on sequencing to avoid traffic drop when management servers are distributed

Challenging but not impossible. Also, we're running a single Contour because of the various other protocol issues. It's handling 100+ Envoys and that could be your recommended split-pod Deployment (i.e. replicas: 1). The above comment is specific to distributed servers (plural), implying a singleton management server wouldn't face the same challenges.

To implement ordering we would probably have to switch to ADS

That would probably be the best approach but some coordinated locking on the individual streams could work too. From what I see in logs LDS/CDS/RDS have no resource hints so you can count on only 1 DiscoveryRequest of each. EDS is variable but known by Contour so it seems feasible.

Running this sequence in a loop seems relatively (to switching to ADS) easy
CDS > variable but known EDS > LDS > RDS

I've built this type of "ticketing system" using Go channels before. It's not trivial but not very difficult either.

I know you've asked for this before (sorry I cannot find the issue at the moment)

No worries. I previously asked about incremental xDS #1058

@davecheney davecheney removed the blocked/needs-info Categorizes the issue or PR as blocked because there is insufficient information to advance it. label Oct 10, 2019
@davecheney davecheney added this to the Backlog milestone Oct 28, 2019
jpeach added a commit to jpeach/contour that referenced this issue Nov 12, 2019
Since the Contour GRPC server already multiplexes Envoy requests
using the resource type URL, we can support ADS simply by enabling
it in the bootstrap configuration.

This fixes projectcontour#1178.

Signed-off-by: James Peach <[email protected]>
jpeach added a commit to jpeach/contour that referenced this issue Nov 12, 2019
Since the Contour GRPC server already multiplexes Envoy requests
using the resource type URL, we can support ADS simply by enabling
it in the bootstrap configuration.

This fixes projectcontour#1178.
This fixes projectcontour#1286.

Signed-off-by: James Peach <[email protected]>
jpeach added a commit to jpeach/contour that referenced this issue Dec 1, 2019
Since the Contour GRPC server already multiplexes Envoy requests
using the resource type URL, we can support ADS simply by enabling
it in the bootstrap configuration.

This fixes projectcontour#1178.
This fixes projectcontour#1286.

Signed-off-by: James Peach <[email protected]>
@jpeach
Copy link
Contributor

jpeach commented Dec 17, 2019

I generated cluster churn by concurrently adding and deleting HTTPProxy objects. While this was happening, I sent traffic to a different service in the same cluster that I wasn't modifying. The target service never dropped any requests. Restarting Contour didn't cause any dropped requests either.

It's perfectly possible that there is a scenario where updating the Envoy config could cause dropped requests, but I haven't reproduced it yet.

@jpeach
Copy link
Contributor

jpeach commented Dec 20, 2019

Tested a new scenario. This theory is that when a new envoy pod starts up, it will start taking requests before it has been fully programmed by Contour. If this is the case, we would expect that when we add new envoy pods, there is a chance that the new envoy would serve an error.

I experimented with this in a 5 node cluster running envoy as a deployment. When I scale the deployment from 1 to 5 replicas, I never see any dropped requests. When I scale down, however, it is possible for connections to be refused, which could be addressed by improving the graceful drain of the envoy pods.

@lrouquette
Copy link

I think it'll be harder to reproduce this during an Envoy restart vs. a Contour restart. On Envoy start, the requests are naturally ordered: as an example, Envoy won't ask for endpoints for a cluster it doesn't know about -that is to say, it won't even create a gRPC stream for EDS, until it receives CDS (the hint is the cluster name); note that RDS is scoped to the listener (e.g. ingress_http or ingress_https), so it's possible for RDS to include routes for clusters unknown to Envoy (and ordering may matter in this case).

Now, on Contour start (with a pre-configured Envoy), Envoy will re-connect its gRPC streams in an un-ordered fashion. If resources have changed in between the disconnect and the reconnect, then bad things may happen if the responses are not ordered. This was definitely more prevalent before this fix because a single contour-initiated endpoint update would affect all of them (now, it only affects that very one that changed). But you still have the issue of RDS including route configs of clusters that Envoy doesn't know about.

@jpeach
Copy link
Contributor

jpeach commented Dec 23, 2019

So if there are changes when Contour isn't running, then you can get user-visible errors. However, when Contour starts coming up, those errors could transiently changes (e.g. 404 --> 503 -> 200). Is that your observation?

@lrouquette
Copy link

I'm just saying that is one way to reproduce. The root of the problem is that Contour receives un-ordered updates from k8s -expected, and should order the xDS responses (which is not happening).

@jpeach
Copy link
Contributor

jpeach commented Jan 2, 2020 via email

@lrouquette
Copy link

Yeah, I hear ya :-) It's definitely not easy without explicitly coding the dis-order (which won't work for your test). We've seen this issue more often on bigger clusters, say 50 envoy nodes. Perhaps you can try that.

Happy to see ADS in the picture though, thats' awesome!

@jpeach
Copy link
Contributor

jpeach commented Jan 5, 2020 via email

@lrouquette
Copy link

We worked around it, so I'll have to go back and see if I can repro. I'll let you know.

jpeach added a commit to jpeach/contour that referenced this issue Jan 16, 2020
Since the Contour GRPC server already multiplexes Envoy requests
using the resource type URL, we can support ADS simply by enabling
it in the bootstrap configuration.

This fixes projectcontour#1178.
This fixes projectcontour#1286.

Signed-off-by: James Peach <[email protected]>
jpeach added a commit to jpeach/contour that referenced this issue Jan 20, 2020
Since the Contour GRPC server already multiplexes Envoy requests
using the resource type URL, we can support ADS simply by enabling
it in the bootstrap configuration.

This fixes projectcontour#1178.
This fixes projectcontour#1286.

Signed-off-by: James Peach <[email protected]>
@skriss skriss removed this from the Backlog milestone Jul 25, 2022
Copy link

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2024
Copy link

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

5 participants