Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache sync latency #5550

Closed
therealak12 opened this issue Jul 12, 2023 · 6 comments · Fixed by #5672
Closed

Cache sync latency #5550

therealak12 opened this issue Jul 12, 2023 · 6 comments · Fixed by #5672
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@therealak12
Copy link
Contributor

therealak12 commented Jul 12, 2023

What steps did you take and what happened:

  1. We have a cluster with around 3000 httpproxy objects. Also, the connection between the etcd and the disk utilized as its storage is slow.
  2. When the latency escalates, the controller manager process terminates with a leader election lost error message.
  3. After being restarted, the new process marks nearly all httpproxy objects as invalid with an error message like secret for fallback certificate not found or target service not found.

What did you expect to happen:
The controller should wait for the cache sync and then start validating httpproxy objects.

Anything else you would like to add:
I'm not sure if the cache sync latency is the issue. Another possibility is the latency in reconciling the TLSCertificateDelegation resource.

Environment:

  • Contour version: v1.25.0
  • Kubernetes version: (use kubectl version): v1.23.3
  • Kubernetes installer & version: okd4 (installation process)
  • Cloud provider or hardware configuration:
    CPU: 24 (Skylake, IBRS)
    Arch: x86_64
    Memory: 62Gi
  • OS (e.g. from /etc/os-release): CentOS 7
@therealak12 therealak12 added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Jul 12, 2023
@github-actions
Copy link

Hey @therealak12! Thanks for opening your first issue. We appreciate your contribution and welcome you to our community! We are glad to have you here and to have your input on Contour. You can also join us on our mailing list and in our channel in the Kubernetes Slack Workspace

@therealak12
Copy link
Contributor Author

therealak12 commented Aug 8, 2023

Some more details on this issue:

Contour fills an additional cache (separate from the controller runtime's cache). As far as I've understood, the cache is filled gradually as the OnAdd, OnUpdate, and OnDelete events of controller-runtime (or better to say client-go) are fired. Here are some important points in the code:

A challenge arises in that the eventHandler added to the manager solely implements the LeaderElectionRunnable interface. However, IMHO, it should implement the hasCache interface so that the manager could detect when the cache is synced!

In our cluster, which utilizes a slow-disk etcd, the sequence of events is as follows:

  1. The WaitForCacheSync function is called to ensure the controller runtime cache is synced.
  2. The operator tries to find a delegated secret here from the dag's internal cache which may not be necessarily in sync with the most recent etcd state.
  3. Consequently, The httpproxy objects will be marked as invalid one by one with a SecretNotValid condition and a secret not found error message.

A trivial solution is to implement a WaitForCacheSync method for the internal cache so that the controller runtime could ensure cache sync. I'm not 100% sure how to implement this method but I'm trying to figure it out. I'd really appreciate any help on this or any other idea on how to solve this issue.

@therealak12
Copy link
Contributor Author

The client-go informers list their specific resources initially. Later the HasSynced function checks if items of the initial have been popped completely.

The same implementation for the contour dag's internal cache is possible though not trivial. We have to list the resources we are interested in and pass them to the contourHandler. Then the handler itself should pop each object from the list after it's handled. A new HasSynced method can be introduced which checks if all of the list items are popped.

As a workaround, we can make HoldoffDelay and HoldoffDelayMax configurable so that clusters with slow etcd can set these to higher values to ensure Dag is built after the full cache sync.
Increasing these values might result in high latency in the reconciliation of ingress objects but is preferable to downtimes caused by invalidated ingress objects.

These stuff are just my interpretation of the contour's logic. If any of the maintainers could approve these suggestions, I'm okay with implementing them. Thanks in advance.

@tsaarni
Copy link
Member

tsaarni commented Aug 9, 2023

I'm not familiar with this, so take following with a grain of salt, but some time ago I saw client-go and controller-runtime added a new parameter isInInitialList bool to OnAdd(). If adding debug print for it here, it seemed like it was true for objects added during initial synchronization and false for the ones added later.

I did not attempt to find out more about it, to prove this is the case, or to see if there are other ways. Anyhow, it seemed bit backward to me, since from that boolean you still do not know if that was the last OnAdd(isInInitialList=true) call there will be, other than maybe by having a timeout that gets extended by short period at every OnAdd(isInInitialList=true) call?

@therealak12
Copy link
Contributor Author

Many thanks!
This isInInitialList perfectly aligns with the requirement!
I'll investigate the existing implementations using this boolean, and try to implement the same for DAG's internal cache (if you agree).

@tsaarni
Copy link
Member

tsaarni commented Aug 9, 2023

I'll investigate the existing implementations using this boolean, and try to implement the same for DAG's internal cache (if you agree).

It would be absolutely fantastic if you do! 😄 🎉

@skriss skriss added this to the 1.27.0 milestone Aug 21, 2023
@skriss skriss added this to Contour Aug 21, 2023
@skriss skriss moved this to In Progress in Contour Aug 21, 2023
@skriss skriss removed the lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. label Oct 4, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Contour Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants