Cache controllers #3589

jbartosik · 2020-10-08T11:49:37Z

Cache results of attempting to determine the parent a controller.

This is to improve VPAs performance in big clusters, where we need to throttle queries we're making.

How caching is supposed to work:

When we first try to determine parent of a controller try to get it (and cache the result).
- This should happen rarely (when new controllers appear and when VPA starts). So it shouldn't affect VPAs performance heavily
- It let's us avoid incorrect state
On consequent attempts to get parent of a controller return the cached result
- We do this frequently, so this should significantly improve our performance
Refresh content of cache and remove unused entries in parallel

jbartosik · 2020-10-08T18:07:21Z

/hold
I think it's ready for review but want to do some more tests before merging

mwielgus

The PR contains non trivial amount of code, but yet it doesn't have any description. Could you please provide some information what it does and why is it needed?

jbartosik · 2020-10-09T13:01:30Z

The PR contains non trivial amount of code, but yet it doesn't have any description. Could you please provide some information what it does and why is it needed?

I added a description

jbartosik · 2020-10-13T15:36:02Z

/hold cancel

I ran full-vpa and recommender tests with this PR and they passed.

jbartosik · 2020-10-14T16:15:57Z

@bskiba PTAL

I checked that with this change and 200 Tekton deployments in a cluster and VPA seems to work correctly

bskiba · 2020-10-16T15:43:39Z

Looking at this now.

bskiba · 2020-10-16T16:43:19Z

OK, I had a first look and looks good, thanks for splitting it into nicely separated commits, makes for a way easier review.
I'll need to take another look on Monday for a deeper review. So far I'd like to confirm some basic things.

When testing with 200 controllers, did you observe api call throttling?
Did you watch the memory consumption? Would be good to confirm we're not leaking, i.e. once we delete the controllers we should see a drop in memory after ttl passes.
With this solution, don't we still issue scale subresource calls when checking if the controller is scalable?

autoscaler/vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go

Line 307 in 8711a1e

scale, err := f.scaleNamespacer.Scales(namespace).Get(context.TODO(), groupResource, name, metav1.GetOptions{})

jbartosik · 2020-10-19T11:52:32Z

OK, I had a first look and looks good, thanks for splitting it into nicely separated commits, makes for a way easier review.
I'll need to take another look on Monday for a deeper review. So far I'd like to confirm some basic things.

When testing with 200 controllers, did you observe api call throttling?

No, I don't see any logs about throttling.

Did you watch the memory consumption? Would be good to confirm we're not leaking, i.e. once we delete the controllers we should see a drop in memory after ttl passes.

I didn't see significant change in memory usage (which is not surprising, we're storing only a couple of controller keys per controller). I'll run another test where I create a bunch of controllers to see memory usage increase, then wait to see if it drops after time.

With this solution, don't we still issue scale subresource calls when checking if the controller is scalable?

autoscaler/vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go

Line 307 in 8711a1e

scale, err := f.scaleNamespacer.Scales(namespace).Get(context.TODO(), groupResource, name, metav1.GetOptions{})

We do. I'll add caching for that too. Is it ok if I do this in a separate PR?

jbartosik · 2020-10-20T11:07:52Z

Did you watch the memory consumption? Would be good to confirm we're not leaking, i.e. once we delete the controllers we should see a drop in memory after ttl passes.

I didn't see significant change in memory usage (which is not surprising, we're storing only a couple of controller keys per controller). I'll run another test where I create a bunch of controllers to see memory usage increase, then wait to see if it drops after time.

I started a a few hundreds of pods. As a result memory usage increased from 23Mi to t o31Mi. When I deleted pods memory usage decreased to 26Mi and later (not sure after how much time) down to 25Mi. I'll repeat and see what happens

jbartosik · 2020-10-26T16:10:33Z

@kgolab PTAL

jbartosik · 2020-10-26T16:47:58Z

With this solution, don't we still issue scale subresource calls when checking if the controller is scalable?

autoscaler/vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go

Line 307 in 8711a1e

scale, err := f.scaleNamespacer.Scales(namespace).Get(context.TODO(), groupResource, name, metav1.GetOptions{})

We do. I'll add caching for that too. Is it ok if I do this in a separate PR?

Done

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go

kgolab · 2020-10-27T21:47:55Z

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go

+	defer cc.mux.Unlock()
+	now := now()
+	for k, v := range cc.cache {
+		if now.After(v.refreshAfter) {


I'd like to understand the idea behind refreshing & clearing stale entries a little better.

IIUC when an entry becomes idle, we'd refresh it few times before deleting which seems counterintuitive.
If this is true, shouldn't we throttle refreshes only to entries which were read recently? Then Get should probably return false to force a refresh upon reading an idle (& not refreshed) entry.

I chose the simplest approach I thought would work (periodically refresh all entries, remove entries that weren't read in a while).

If this is true, shouldn't we throttle refreshes only to entries which were read recently? Then Get should probably return false to force a refresh upon reading an idle (& not refreshed) entry.

If Get returns false then the entry is effectively removed. So if I understand your idea correctly it's effectively the same as setting shorter lifetime for cache entries (less than 2 refresh durations?)

If it's about simplicity then why not just:

Insert sets validity time (let's say it's current scaleCacheEntryFreshnessTime),

Get checks validity time and returns false if the entry is expired; the caller has to call Insert then,

there is no background refresh,

there is only background garbage collection?

This reduces number of calls to scaleNamespacer.Get as we never call it for entires that are no longer present.
I understand that the drawback of this solution is that all Gets are executed in the main VPA loop instead of being spread evenly over time.

So basically the discussion boils down to a question: "how long do we actively use an entry, on average"; if it's much longer than scaleCacheEntryLifetime, the amortised cost of multiple refreshes of a no-longer-used entry is low and then the gain from spreading scaleNamespacer.Get calls is likely more important.

If you think we'd indeed have long-lived entries (seems plausible) let's leave the solution as it is now.
I'd love to see some metric (or logs with counter) added at some point so we can assess the impact of this change and maybe fine-tune scaleCacheEntryLifetime vs scaleCacheEntryFreshnessTime, likely lowering the former.

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go

jbartosik · 2020-10-28T13:37:25Z

@kgolab PTAL

Many questions and some nits but I guess it's worth discussing how exactly the case is expected to behave with regards to refreshes and removal of stale entries.

I pushed changes with comments I applied. I had questions about a few and left them open. I'll add explanation why I want the cache to work this way and a TODO we discussed later today.

@kgolab I added the explanation (above controllerCacheStorage) and TODO. Please take a look

jbartosik · 2020-11-02T09:43:43Z

@kgolab @bskiba Please take a look

bskiba · 2020-11-03T15:50:30Z

I've set aside some time tomorrow to take a look 👀

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go

bskiba · 2020-11-04T09:40:18Z

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go

+	if _, ok := cc.cache[key]; ok {
+		return
+	}
+	jitter := time.Duration(rand.Float64()*float64(cc.refreshJitter.Nanoseconds())) * time.Nanosecond


The function is not used here

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go

vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go

vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go

jbartosik · 2020-11-04T14:29:31Z

@bskiba @kgolab I pushed changes, please take a look

jbartosik · 2020-11-04T14:30:18Z

@bskiba I can't reply directly to the comment about using jitter function from API machinery but I applied it

bskiba

Some nits left.

Overall I think we are close to what we want, would like to make sure the cache constants we choose work in our favor.

@kgolab Would you be able to take another look today? I would like you to explicitly approve as well once you're happy with this PR

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go

bskiba · 2020-11-10T09:19:59Z

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go

+	if ok, scale, err := f.controllerCache.Get(namespace, groupResource, name); ok {
+		return scale, err
+	}
+	scale, err := f.scaleNamespacer.Scales(namespace).Get(context.TODO(), groupResource, name, metav1.GetOptions{})


That makes sense, I'm not gonna be supper stubborn about it :)

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go

vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go

jbartosik · 2020-11-10T12:14:04Z

@bskiba @kgolab PTAL

bskiba

I'm happy, I'll LGTM once kgolab takes one more look (I think more eyes are better here as caching is always tricky :) )

bskiba · 2020-11-10T15:17:16Z

And thanks a lot for the work that went into this!

jbartosik · 2020-11-13T13:57:28Z

@kgolab please take a look

kgolab · 2020-11-20T13:34:54Z

vertical-pod-autoscaler/pkg/recommender/input/cluster_feeder.go

@@ -49,7 +49,13 @@ import (
 	resourceclient "k8s.io/metrics/pkg/client/clientset/versioned/typed/metrics/v1beta1"
 )

-const defaultResyncPeriod time.Duration = 10 * time.Minute
+const (
+	scaleCacheLoopPeriod         time.Duration = time.Minute


If you really want this to be spread independently from the VPA main loop, maybe choose some other period, preferably relatively prime to main loop's 60s period?

IIUC it might be a relatively short one, e.g. 13s, as the refresh loop is pretty lightweight except for scaleNamespacer.Get calls which won't change in number but instead get spread more evenly.

Reading through older comments I should've expressed myself clearer when asking for lowering this value from 10 minutes, sorry.

Done. 7s because it's the shortest that's in full seconds and relatively prime

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go

kgolab · 2020-11-20T14:02:35Z

Thanks!

I added more comments but this already looks good to me.
Please note though that some of this comments are in older conversations so might be more tricky to find, I don't know how to make github show them all together.

And I'm really sorry for taking so much time to get back to this. Please ping me earlier next time.

kgolab · 2020-11-20T14:05:24Z

@bskiba , I cannot execute the command so letting you know here: LGTM

jbartosik · 2020-11-23T16:00:25Z

@kgolab replies to comments I can't find here but got email notifications for:

we don't want to refresh entries in the main loop so it wont' get stuck so I'm keeping as is (but I'm making refresh loop execute more frequently)
I added a TODO to add something that will let us optimize performance. I'll probably do it in two stages. First I'll add something to let us determine if we need to improve performance. If we think we need performance improvements I think it's best to use something more detailed than average lifetime (e.g. I think we might want to check for changes in young collections and error responses more quickly than in collections which were around for long time, unchanged)
I changed the test for inserting over an old value to not care which value it keeps
renamed controllerCache to scaleSubresourceCacheStorage

Type that caches results of attempts to get parent of controllers.

bskiba · 2020-11-24T14:59:24Z

Thanks a lot Joachim
/lgtm
/approve

k8s-ci-robot · 2020-11-24T14:59:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bskiba, jbartosik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~vertical-pod-autoscaler/OWNERS~~ [bskiba,jbartosik]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Cache controllers

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 8, 2020

k8s-ci-robot requested review from krzysied and MaciekPytel October 8, 2020 11:49

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2020

jbartosik force-pushed the cache-controllers branch 5 times, most recently from 9750f74 to 8711a1e Compare October 8, 2020 16:51

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 8, 2020

mwielgus suggested changes Oct 8, 2020

View reviewed changes

jbartosik requested a review from mwielgus October 9, 2020 13:01

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 13, 2020

bskiba mentioned this pull request Oct 16, 2020

cache crd objects to speed up VpaTargetSelectorFetcher and ControllerFetcher #3412

Closed

jbartosik force-pushed the cache-controllers branch from 8711a1e to 378798b Compare October 26, 2020 16:09

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2020

kgolab reviewed Oct 27, 2020

View reviewed changes

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go Show resolved Hide resolved

kgolab reviewed Oct 27, 2020

View reviewed changes

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go Outdated Show resolved Hide resolved

kgolab reviewed Oct 27, 2020

View reviewed changes

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_cache_storage.go Show resolved Hide resolved

jbartosik force-pushed the cache-controllers branch from 70799e8 to 474c6e3 Compare October 28, 2020 13:36

bskiba reviewed Nov 4, 2020

View reviewed changes

jbartosik force-pushed the cache-controllers branch from 474c6e3 to 4440cdf Compare November 4, 2020 14:29

bskiba reviewed Nov 10, 2020

View reviewed changes

jbartosik force-pushed the cache-controllers branch from 4440cdf to 567b81c Compare November 10, 2020 11:00

bskiba approved these changes Nov 10, 2020

View reviewed changes

kgolab reviewed Nov 20, 2020

View reviewed changes

vertical-pod-autoscaler/pkg/recommender/input/controller_fetcher/controller_fetcher.go Outdated Show resolved Hide resolved

jbartosik added 2 commits November 23, 2020 17:02

Add controllerCacheStorage

bb9b23c

Type that caches results of attempts to get parent of controllers.

Use controllerCacheStorage

9d7898a

jbartosik force-pushed the cache-controllers branch from 567b81c to 9d7898a Compare November 23, 2020 16:02

k8s-ci-robot assigned bskiba Nov 24, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 24, 2020

k8s-ci-robot merged commit 857477d into kubernetes:master Nov 24, 2020

crgarcia12 added a commit to crgarcia12/autoscaler that referenced this pull request Nov 25, 2020

Merge pull request kubernetes#3589 from jbartosik/cache-controllers

057f57f

Cache controllers

jbartosik mentioned this pull request Nov 27, 2020

Cache controllers 09 #3729

Merged

jbartosik deleted the cache-controllers branch January 15, 2021 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache controllers #3589

Cache controllers #3589

jbartosik commented Oct 8, 2020 •

edited

Loading

jbartosik commented Oct 8, 2020

mwielgus left a comment

jbartosik commented Oct 9, 2020

jbartosik commented Oct 13, 2020

jbartosik commented Oct 14, 2020

bskiba commented Oct 16, 2020

bskiba commented Oct 16, 2020 •

edited

Loading

jbartosik commented Oct 19, 2020

jbartosik commented Oct 20, 2020

jbartosik commented Oct 26, 2020

jbartosik commented Oct 26, 2020

kgolab Oct 27, 2020

jbartosik Oct 28, 2020

kgolab Nov 20, 2020

jbartosik commented Oct 28, 2020

jbartosik commented Nov 2, 2020

bskiba commented Nov 3, 2020

bskiba Nov 4, 2020

jbartosik commented Nov 4, 2020

jbartosik commented Nov 4, 2020

bskiba left a comment

bskiba Nov 10, 2020

jbartosik commented Nov 10, 2020 •

edited

Loading

bskiba left a comment

bskiba commented Nov 10, 2020

jbartosik commented Nov 13, 2020

kgolab Nov 20, 2020 •

edited

Loading

jbartosik Nov 23, 2020

kgolab commented Nov 20, 2020

kgolab commented Nov 20, 2020

jbartosik commented Nov 23, 2020

bskiba commented Nov 24, 2020

k8s-ci-robot commented Nov 24, 2020

Cache controllers #3589

Cache controllers #3589

Conversation

jbartosik commented Oct 8, 2020 • edited Loading

jbartosik commented Oct 8, 2020

mwielgus left a comment

Choose a reason for hiding this comment

jbartosik commented Oct 9, 2020

jbartosik commented Oct 13, 2020

jbartosik commented Oct 14, 2020

bskiba commented Oct 16, 2020

bskiba commented Oct 16, 2020 • edited Loading

jbartosik commented Oct 19, 2020

jbartosik commented Oct 20, 2020

jbartosik commented Oct 26, 2020

jbartosik commented Oct 26, 2020

kgolab Oct 27, 2020

Choose a reason for hiding this comment

jbartosik Oct 28, 2020

Choose a reason for hiding this comment

kgolab Nov 20, 2020

Choose a reason for hiding this comment

jbartosik commented Oct 28, 2020

jbartosik commented Nov 2, 2020

bskiba commented Nov 3, 2020

bskiba Nov 4, 2020

Choose a reason for hiding this comment

jbartosik commented Nov 4, 2020

jbartosik commented Nov 4, 2020

bskiba left a comment

Choose a reason for hiding this comment

bskiba Nov 10, 2020

Choose a reason for hiding this comment

jbartosik commented Nov 10, 2020 • edited Loading

bskiba left a comment

Choose a reason for hiding this comment

bskiba commented Nov 10, 2020

jbartosik commented Nov 13, 2020

kgolab Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

jbartosik Nov 23, 2020

Choose a reason for hiding this comment

kgolab commented Nov 20, 2020

kgolab commented Nov 20, 2020

jbartosik commented Nov 23, 2020

bskiba commented Nov 24, 2020

k8s-ci-robot commented Nov 24, 2020

jbartosik commented Oct 8, 2020 •

edited

Loading

bskiba commented Oct 16, 2020 •

edited

Loading

jbartosik commented Nov 10, 2020 •

edited

Loading

kgolab Nov 20, 2020 •

edited

Loading