Wait to discover all K8s resources before sending xDS responses #1280

phylake · 2019-07-29T21:22:10Z

As far as I can tell Contour will send whatever resources it has whenever xDS DiscoveryRequests come in. This can cause xDS resources to be torn down in Envoy on a Contour restart. Contour should do a synchronous (read: eager and blocking) lookup of its CRDs, Ingresses, Services, Endpoints, Secrets, etc. on startup before responding to DiscoveryRequests.

This is causing a variety of issues depending on the resource. Examples include:

Missing EDS causes 503s
Missing RDS causes 404s
Missing SDS causes SSL handshake errors

/cc @bryanlatten @lrouquette

Tasks

Wait for k8s informer caches to sync
Wait for dag to be fully built before starting gRPC server

Blocked

Leader election Contour should move to a leader election model #1204

phylake · 2019-07-31T18:19:53Z

This is different than #1178 which is about ordering. Even if #1178 is fixed Contour will still send only a subset of resources, subject to K8s informer timing, and causes 404s in Envoy when Contour restarts because Contour hasn't discovered all the resources in K8s that it previously programmed Envoy with

Example for CDS:

Contour1 is Contour 1
Contour2 is Contour 2 (i.e. the Contour that starts after Contour1 dies)
Cluster1 is some upstream service
Cluster2 is some other upstream service

Contour1 discovers Cluster1 and responds to CDS with it. It later discovers Cluster2 and responds to CDS with Cluster1 and Cluster2. We'll consider Envoy "fully programmed" for the purpose of this example. Contour1 is terminated for some reason (upgrade, node death, etc.)

Contour2 discovers Cluster1 and responds to CDS with it. Note that Cluster2 is missing which means Envoy will remove it. From the docs:

LDS/CDS resource hints will always be empty and it is expected that the management server will provide the complete state of the LDS/CDS resources in each response. An absent Listener or Cluster will be deleted.

Envoy will also remove the corresponding EDS and RDS.

When a Listener or Cluster is deleted, its corresponding EDS and RDS resources are also deleted inside the Envoy instance.

Whatever cluster (and corresponding resources) was described by Cluster2 now returns 404s until Contour2 discovers Cluster2 and programs Envoy with Cluster1 and Cluster2. Assuming #1178 is fixed then EDS and RDS will come in order per the xDS protocol spec.

This problem isn't exclusive to CDS. It's a general problem that needs addressed for all K8s resources that can end up in xDS responses.

bryanlatten · 2019-08-01T14:42:54Z

Seemingly related: #1286

stevesloka · 2019-08-07T19:12:54Z

Yup, we do wait for the caches to sync up (https://github.com/heptio/contour/blob/master//cmd/contour/serve.go#L383), but we're not blocking on the gRPC connection to xDS.

lrouquette · 2019-08-07T21:13:17Z

Sure, you're sync-ing on an empty cache (the SharedInformer Store is empty), so you're not going to wait very long. We implemented a synchronous discovery of the resources, but that has some ramifications preventing me from filing a PR as-is. I think what needs to happen here is something like this:

Start the SharedInformer
Wait for it to Sync
Wait for the cache to sync
Then, and only then, start the gRPC server

I'll see if I can come up with something

davecheney · 2019-08-19T02:12:56Z

Moving tentatively to the next milestone. The plan is to not bring up the grpc server until the shared informer has synced and we have populated the grpc cache at least once.

stevesloka · 2019-10-24T15:33:30Z

In #1765 we added some logic to have the caches sync before starting the gRPC server, however, it's possible that the final DAG still isn't yet built. Leaving this as the open remaining task.

m-yosefpor · 2023-08-06T08:13:40Z

this issue is seriously affecting us as we have many HTTPProxy and it will cause some minutes of downtime with every contour restart!

$ oc get httpproxy -A | wc -l
3407

stevesloka added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 7, 2019

davecheney added the blocked Blocked waiting on a dependency label Aug 19, 2019

davecheney added this to the 1.0.0-beta.1 milestone Aug 19, 2019

This was referenced Aug 19, 2019

Contour should move to a leader election model #1204

Closed

Leader election should be multi-read, single write style #1385

Closed

stevesloka removed the blocked Blocked waiting on a dependency label Aug 27, 2019

davecheney modified the milestones: 1.0.0-beta.1, 1.0.0-rc.1 Sep 24, 2019

davecheney modified the milestones: 1.0.0-rc.1, 1.0.0-rc.2 Oct 7, 2019

stevesloka self-assigned this Oct 16, 2019

stevesloka mentioned this issue Oct 22, 2019

cmd/contour: Wait for informers to sync #1765

Merged

stevesloka modified the milestones: 1.0.0-rc.2, Backlog Oct 24, 2019

stevesloka removed their assignment Oct 24, 2019

This was referenced Oct 8, 2020

internal: consistently use the controller-runtime informer cache #2990

Merged

wait for informer caches to sync before processing events #3020

Closed

skriss removed this from the Backlog milestone Jul 25, 2022

sunjayBhatia mentioned this issue Jan 19, 2023

upgrade tests: Investigate flakiness (NC Upstream cluster not found response flag) #4991

Closed

therealak12 mentioned this issue Aug 25, 2023

wait for cache sync and DAG build before starting xDS server #5672

Merged

sunjayBhatia closed this as completed in #5672 Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait to discover all K8s resources before sending xDS responses #1280

Wait to discover all K8s resources before sending xDS responses #1280

phylake commented Jul 29, 2019 •

edited by jpeach

Loading

phylake commented Jul 31, 2019

bryanlatten commented Aug 1, 2019

stevesloka commented Aug 7, 2019

lrouquette commented Aug 7, 2019

davecheney commented Aug 19, 2019

stevesloka commented Oct 24, 2019

m-yosefpor commented Aug 6, 2023

Wait to discover all K8s resources before sending xDS responses #1280

Wait to discover all K8s resources before sending xDS responses #1280

Comments

phylake commented Jul 29, 2019 • edited by jpeach Loading

phylake commented Jul 31, 2019

bryanlatten commented Aug 1, 2019

stevesloka commented Aug 7, 2019

lrouquette commented Aug 7, 2019

davecheney commented Aug 19, 2019

stevesloka commented Oct 24, 2019

m-yosefpor commented Aug 6, 2023

phylake commented Jul 29, 2019 •

edited by jpeach

Loading