Cache sync latency #5550

therealak12 · 2023-07-12T14:05:03Z

What steps did you take and what happened:

We have a cluster with around 3000 httpproxy objects. Also, the connection between the etcd and the disk utilized as its storage is slow.
When the latency escalates, the controller manager process terminates with a leader election lost error message.
After being restarted, the new process marks nearly all httpproxy objects as invalid with an error message like secret for fallback certificate not found or target service not found.

What did you expect to happen:
The controller should wait for the cache sync and then start validating httpproxy objects.

Anything else you would like to add:
I'm not sure if the cache sync latency is the issue. Another possibility is the latency in reconciling the TLSCertificateDelegation resource.

Environment:

Contour version: v1.25.0
Kubernetes version: (use kubectl version): v1.23.3
Kubernetes installer & version: okd4 (installation process)
Cloud provider or hardware configuration:
CPU: 24 (Skylake, IBRS)
Arch: x86_64
Memory: 62Gi
OS (e.g. from /etc/os-release): CentOS 7

The text was updated successfully, but these errors were encountered:

github-actions · 2023-07-12T14:05:20Z

Hey @therealak12! Thanks for opening your first issue. We appreciate your contribution and welcome you to our community! We are glad to have you here and to have your input on Contour. You can also join us on our mailing list and in our channel in the Kubernetes Slack Workspace

therealak12 · 2023-08-08T20:00:49Z

Some more details on this issue:

Contour fills an additional cache (separate from the controller runtime's cache). As far as I've understood, the cache is filled gradually as the OnAdd, OnUpdate, and OnDelete events of controller-runtime (or better to say client-go) are fired. Here are some important points in the code:

A challenge arises in that the eventHandler added to the manager solely implements the LeaderElectionRunnable interface. However, IMHO, it should implement the hasCache interface so that the manager could detect when the cache is synced!

In our cluster, which utilizes a slow-disk etcd, the sequence of events is as follows:

The WaitForCacheSync function is called to ensure the controller runtime cache is synced.
The operator tries to find a delegated secret here from the dag's internal cache which may not be necessarily in sync with the most recent etcd state.
Consequently, The httpproxy objects will be marked as invalid one by one with a SecretNotValid condition and a secret not found error message.

A trivial solution is to implement a WaitForCacheSync method for the internal cache so that the controller runtime could ensure cache sync. I'm not 100% sure how to implement this method but I'm trying to figure it out. I'd really appreciate any help on this or any other idea on how to solve this issue.

therealak12 · 2023-08-09T10:47:06Z

The client-go informers list their specific resources initially. Later the HasSynced function checks if items of the initial have been popped completely.

The same implementation for the contour dag's internal cache is possible though not trivial. We have to list the resources we are interested in and pass them to the contourHandler. Then the handler itself should pop each object from the list after it's handled. A new HasSynced method can be introduced which checks if all of the list items are popped.

As a workaround, we can make HoldoffDelay and HoldoffDelayMax configurable so that clusters with slow etcd can set these to higher values to ensure Dag is built after the full cache sync.
Increasing these values might result in high latency in the reconciliation of ingress objects but is preferable to downtimes caused by invalidated ingress objects.

These stuff are just my interpretation of the contour's logic. If any of the maintainers could approve these suggestions, I'm okay with implementing them. Thanks in advance.

tsaarni · 2023-08-09T12:28:12Z

I'm not familiar with this, so take following with a grain of salt, but some time ago I saw client-go and controller-runtime added a new parameter isInInitialList bool to OnAdd(). If adding debug print for it here, it seemed like it was true for objects added during initial synchronization and false for the ones added later.

I did not attempt to find out more about it, to prove this is the case, or to see if there are other ways. Anyhow, it seemed bit backward to me, since from that boolean you still do not know if that was the last OnAdd(isInInitialList=true) call there will be, other than maybe by having a timeout that gets extended by short period at every OnAdd(isInInitialList=true) call?

therealak12 · 2023-08-09T12:49:21Z

Many thanks!
This isInInitialList perfectly aligns with the requirement!
I'll investigate the existing implementations using this boolean, and try to implement the same for DAG's internal cache (if you agree).

tsaarni · 2023-08-09T12:50:59Z

I'll investigate the existing implementations using this boolean, and try to implement the same for DAG's internal cache (if you agree).

It would be absolutely fantastic if you do! 😄 🎉

therealak12 added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Jul 12, 2023

therealak12 mentioned this issue Aug 13, 2023

wait for cache sync and DAG build before starting xDS server #5672

Merged

skriss added this to the 1.27.0 milestone Aug 21, 2023

skriss added this to Contour Aug 21, 2023

skriss moved this to In Progress in Contour Aug 21, 2023

skriss assigned therealak12 Oct 4, 2023

skriss removed the lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. label Oct 4, 2023

sunjayBhatia closed this as completed in #5672 Oct 10, 2023

github-project-automation bot moved this from In Progress to Done in Contour Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache sync latency #5550

Cache sync latency #5550

therealak12 commented Jul 12, 2023 •

edited

Loading

github-actions bot commented Jul 12, 2023

therealak12 commented Aug 8, 2023 •

edited

Loading

therealak12 commented Aug 9, 2023

tsaarni commented Aug 9, 2023

therealak12 commented Aug 9, 2023

tsaarni commented Aug 9, 2023

Cache sync latency #5550

Cache sync latency #5550

Comments

therealak12 commented Jul 12, 2023 • edited Loading

github-actions bot commented Jul 12, 2023

therealak12 commented Aug 8, 2023 • edited Loading

therealak12 commented Aug 9, 2023

tsaarni commented Aug 9, 2023

therealak12 commented Aug 9, 2023

tsaarni commented Aug 9, 2023

therealak12 commented Jul 12, 2023 •

edited

Loading

therealak12 commented Aug 8, 2023 •

edited

Loading