Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalogmetadata client adds extreme latency to reconciliation #914

Closed
Tracked by #950
joelanford opened this issue Jun 7, 2024 · 5 comments · Fixed by #965
Closed
Tracked by #950

catalogmetadata client adds extreme latency to reconciliation #914

joelanford opened this issue Jun 7, 2024 · 5 comments · Fixed by #965
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. v1.0 Issues related to the initial stable release of OLMv1

Comments

@joelanford
Copy link
Member

joelanford commented Jun 7, 2024

I've narrowed this latency down to the Bundles() call that is made for every single reconcile.

I added logging to the catalogmetadata.Client.Bundles() call to report start time and end time, and the function consistently takes around 30s, and it pegs a CPU.

This seems like a bug we should treat with a fairly high priority.

@joelanford joelanford added the kind/bug Categorizes issue or PR as related to a bug. label Jun 7, 2024
@joelanford joelanford added this to OLM v1 Jun 7, 2024
@joelanford joelanford added the v1.0 Issues related to the initial stable release of OLMv1 label Jun 7, 2024
@bentito
Copy link
Contributor

bentito commented Jun 10, 2024

I added some timing data to the Bundles func too, documenting times for the various return points (https://gist.github.com/bentito/9fbf0d81354caa52121f5c4e294bd506). The fake catalog is just 1 entry I think? Would ~350µs (the longest test timing in 1 run (not scientific!) * number items in real catalog * # of calls to Bundle get close to 30s?

@joelanford
Copy link
Member Author

I think that's probably an order of magnitude off still. I suspect that our fake catalog entry isn't as complex and varied as what exists in operatorhub in terms of what takes the most time (i.e. icons, olm.bundle.object properties).

@acornett21
Copy link
Contributor

Are we talking about this function? Or something else?

@joelanford
Copy link
Member Author

Yep, that's the one.

@joelanford
Copy link
Member Author

I think there's a broad change that needs to be made to solve this. We should add a controller for Catalog objects that manages a local cache for each catalog that exists. This controller would be able to:

  • fetch catalog contents as soon as a catalog is unpacked (not wait until a cluster extension needs to be reconciled)
  • delete caches for catalogs that no longer exist (rather than leave them around forever)
  • organize the cache in a way that optimizes the interaction from the ClusterExtension reconciler (for one, we could essentially index by package name because a ClusterExtension only needs information for spec.packageName from the catalogs.)

Separate, but related, I think the catalogmetadata wrapper types are adding some unnecessary abstraction, and we will be better off interacting directly with the FBC data model (e.g. use only operator-registry's declcfg package, to the extent possible)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. v1.0 Issues related to the initial stable release of OLMv1
Projects
Archived in project
3 participants