Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait to discover all K8s resources before sending xDS responses #1280

Closed
2 of 3 tasks
phylake opened this issue Jul 29, 2019 · 7 comments · Fixed by #5672
Closed
2 of 3 tasks

Wait to discover all K8s resources before sending xDS responses #1280

phylake opened this issue Jul 29, 2019 · 7 comments · Fixed by #5672
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@phylake
Copy link

phylake commented Jul 29, 2019

As far as I can tell Contour will send whatever resources it has whenever xDS DiscoveryRequests come in. This can cause xDS resources to be torn down in Envoy on a Contour restart. Contour should do a synchronous (read: eager and blocking) lookup of its CRDs, Ingresses, Services, Endpoints, Secrets, etc. on startup before responding to DiscoveryRequests.

This is causing a variety of issues depending on the resource. Examples include:

  1. Missing EDS causes 503s
  2. Missing RDS causes 404s
  3. Missing SDS causes SSL handshake errors

/cc @bryanlatten @lrouquette


Tasks

  • Wait for k8s informer caches to sync
  • Wait for dag to be fully built before starting gRPC server

Blocked

@phylake
Copy link
Author

phylake commented Jul 31, 2019

This is different than #1178 which is about ordering. Even if #1178 is fixed Contour will still send only a subset of resources, subject to K8s informer timing, and causes 404s in Envoy when Contour restarts because Contour hasn't discovered all the resources in K8s that it previously programmed Envoy with

Example for CDS:

Contour1 is Contour 1
Contour2 is Contour 2 (i.e. the Contour that starts after Contour1 dies)
Cluster1 is some upstream service
Cluster2 is some other upstream service

Contour1 discovers Cluster1 and responds to CDS with it. It later discovers Cluster2 and responds to CDS with Cluster1 and Cluster2. We'll consider Envoy "fully programmed" for the purpose of this example. Contour1 is terminated for some reason (upgrade, node death, etc.)

Contour2 discovers Cluster1 and responds to CDS with it. Note that Cluster2 is missing which means Envoy will remove it. From the docs:

LDS/CDS resource hints will always be empty and it is expected that the management server will provide the complete state of the LDS/CDS resources in each response. An absent Listener or Cluster will be deleted.

Envoy will also remove the corresponding EDS and RDS.

When a Listener or Cluster is deleted, its corresponding EDS and RDS resources are also deleted inside the Envoy instance.

Whatever cluster (and corresponding resources) was described by Cluster2 now returns 404s until Contour2 discovers Cluster2 and programs Envoy with Cluster1 and Cluster2. Assuming #1178 is fixed then EDS and RDS will come in order per the xDS protocol spec.

This problem isn't exclusive to CDS. It's a general problem that needs addressed for all K8s resources that can end up in xDS responses.

@bryanlatten
Copy link

Seemingly related: #1286

@stevesloka
Copy link
Member

Yup, we do wait for the caches to sync up (https://github.com/heptio/contour/blob/master//cmd/contour/serve.go#L383), but we're not blocking on the gRPC connection to xDS.

@stevesloka stevesloka added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 7, 2019
@lrouquette
Copy link

Sure, you're sync-ing on an empty cache (the SharedInformer Store is empty), so you're not going to wait very long. We implemented a synchronous discovery of the resources, but that has some ramifications preventing me from filing a PR as-is. I think what needs to happen here is something like this:

  1. Start the SharedInformer
  2. Wait for it to Sync
  3. Wait for the cache to sync
  4. Then, and only then, start the gRPC server

I'll see if I can come up with something

@davecheney davecheney added the blocked Blocked waiting on a dependency label Aug 19, 2019
@davecheney davecheney added this to the 1.0.0-beta.1 milestone Aug 19, 2019
@davecheney
Copy link
Contributor

Moving tentatively to the next milestone. The plan is to not bring up the grpc server until the shared informer has synced and we have populated the grpc cache at least once.

@stevesloka
Copy link
Member

In #1765 we added some logic to have the caches sync before starting the gRPC server, however, it's possible that the final DAG still isn't yet built. Leaving this as the open remaining task.

@stevesloka stevesloka modified the milestones: 1.0.0-rc.2, Backlog Oct 24, 2019
@stevesloka stevesloka removed their assignment Oct 24, 2019
@skriss skriss removed this from the Backlog milestone Jul 25, 2022
@sunjayBhatia sunjayBhatia added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Sep 26, 2022
@m-yosefpor
Copy link
Contributor

this issue is seriously affecting us as we have many HTTPProxy and it will cause some minutes of downtime with every contour restart!

$ oc get httpproxy -A | wc -l
3407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants