-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache controllers #3589
Cache controllers #3589
Conversation
9750f74
to
8711a1e
Compare
/hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR contains non trivial amount of code, but yet it doesn't have any description. Could you please provide some information what it does and why is it needed?
I added a description |
/hold cancel I ran |
@bskiba PTAL I checked that with this change and 200 Tekton deployments in a cluster and VPA seems to work correctly |
Looking at this now. |
OK, I had a first look and looks good, thanks for splitting it into nicely separated commits, makes for a way easier review.
|
No, I don't see any logs about throttling.
I didn't see significant change in memory usage (which is not surprising, we're storing only a couple of controller keys per controller). I'll run another test where I create a bunch of controllers to see memory usage increase, then wait to see if it drops after time.
We do. I'll add caching for that too. Is it ok if I do this in a separate PR? |
I started a a few hundreds of pods. As a result memory usage increased from |
8711a1e
to
378798b
Compare
@kgolab PTAL |
Done |
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Outdated
Show resolved
Hide resolved
defer cc.mux.Unlock() | ||
now := now() | ||
for k, v := range cc.cache { | ||
if now.After(v.refreshAfter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to understand the idea behind refreshing & clearing stale entries a little better.
IIUC when an entry becomes idle, we'd refresh it few times before deleting which seems counterintuitive.
If this is true, shouldn't we throttle refreshes only to entries which were read recently? Then Get should probably return false to force a refresh upon reading an idle (& not refreshed) entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose the simplest approach I thought would work (periodically refresh all entries, remove entries that weren't read in a while).
If this is true, shouldn't we throttle refreshes only to entries which were read recently? Then Get should probably return false to force a refresh upon reading an idle (& not refreshed) entry.
If Get
returns false then the entry is effectively removed. So if I understand your idea correctly it's effectively the same as setting shorter lifetime for cache entries (less than 2 refresh durations?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's about simplicity then why not just:
- Insert sets validity time (let's say it's current scaleCacheEntryFreshnessTime),
- Get checks validity time and returns false if the entry is expired; the caller has to call Insert then,
- there is no background refresh,
- there is only background garbage collection?
This reduces number of calls to scaleNamespacer.Get as we never call it for entires that are no longer present.
I understand that the drawback of this solution is that all Gets are executed in the main VPA loop instead of being spread evenly over time.
So basically the discussion boils down to a question: "how long do we actively use an entry, on average"; if it's much longer than scaleCacheEntryLifetime, the amortised cost of multiple refreshes of a no-longer-used entry is low and then the gain from spreading scaleNamespacer.Get calls is likely more important.
If you think we'd indeed have long-lived entries (seems plausible) let's leave the solution as it is now.
I'd love to see some metric (or logs with counter) added at some point so we can assess the impact of this change and maybe fine-tune scaleCacheEntryLifetime vs scaleCacheEntryFreshnessTime, likely lowering the former.
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Show resolved
Hide resolved
70799e8
to
474c6e3
Compare
@kgolab I added the explanation (above |
I've set aside some time tomorrow to take a look 👀 |
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Outdated
Show resolved
Hide resolved
if _, ok := cc.cache[key]; ok { | ||
return | ||
} | ||
jitter := time.Duration(rand.Float64()*float64(cc.refreshJitter.Nanoseconds())) * time.Nanosecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is not used here
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go
Outdated
Show resolved
Hide resolved
474c6e3
to
4440cdf
Compare
@bskiba I can't reply directly to the comment about using jitter function from API machinery but I applied it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits left.
Overall I think we are close to what we want, would like to make sure the cache constants we choose work in our favor.
@kgolab Would you be able to take another look today? I would like you to explicitly approve as well once you're happy with this PR
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Show resolved
Hide resolved
if ok, scale, err := f.controllerCache.Get(namespace, groupResource, name); ok { | ||
return scale, err | ||
} | ||
scale, err := f.scaleNamespacer.Scales(namespace).Get(context.TODO(), groupResource, name, metav1.GetOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I'm not gonna be supper stubborn about it :)
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go
Outdated
Show resolved
Hide resolved
vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go
Outdated
Show resolved
Hide resolved
4440cdf
to
567b81c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy, I'll LGTM once kgolab takes one more look (I think more eyes are better here as caching is always tricky :) )
And thanks a lot for the work that went into this! |
@kgolab please take a look |
@@ -49,7 +49,13 @@ import ( | |||
resourceclient "k8s.io/metrics/pkg/client/clientset/versioned/typed/metrics/v1beta1" | |||
) | |||
|
|||
const defaultResyncPeriod time.Duration = 10 * time.Minute | |||
const ( | |||
scaleCacheLoopPeriod time.Duration = time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really want this to be spread independently from the VPA main loop, maybe choose some other period, preferably relatively prime to main loop's 60s period?
IIUC it might be a relatively short one, e.g. 13s, as the refresh loop is pretty lightweight except for scaleNamespacer.Get calls which won't change in number but instead get spread more evenly.
Reading through older comments I should've expressed myself clearer when asking for lowering this value from 10 minutes, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. 7s because it's the shortest that's in full seconds and relatively prime
vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go
Outdated
Show resolved
Hide resolved
Thanks! I added more comments but this already looks good to me. And I'm really sorry for taking so much time to get back to this. Please ping me earlier next time. |
@bskiba , I cannot execute the command so letting you know here: LGTM |
@kgolab replies to comments I can't find here but got email notifications for:
|
Type that caches results of attempts to get parent of controllers.
567b81c
to
9d7898a
Compare
Thanks a lot Joachim |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bskiba, jbartosik The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Cache controllers
Cache results of attempting to determine the parent a controller.
This is to improve VPAs performance in big clusters, where we need to throttle queries we're making.
How caching is supposed to work: