throttler: don't allocate any resources unless it is actually enabled #8643

deepthi · 2021-08-18T19:38:05Z

Description

Even when -enable_lag_throttler is not set, various caches are created and checked at intervals.
Specifically nonLowPriorityAppRequestsThrottled is checked every 100ms. This is wasteful.
All of this is now gated by the flag. In addition to this, I moved various other initializations (like SelfChecks) to also be gated by the flag.

Related Issue(s)

Checklist

Should this PR be backported? NO
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Signed-off-by: deepthi <[email protected]>

shlomi-noach

This is fine.

The other thing is that we can increase the garbage-collect interval on most of these caches without affecting logic.

deepthi · 2021-08-19T16:31:07Z

Out-of-band comment from @shlomi-noach:

It should be just fine to increase the cleanup interval on that cache/map. Change:
var mysqlMetricCache = cache.New(cache.NoExpiration, 10*time.Millisecond)
to:
var mysqlMetricCache = cache.New(cache.NoExpiration, 10*time.Second)
The logic will behave in exact same way because items are put into the map with an explicit timeout, which is then checked for at Get() time. All the change does is to increase the map's internal garbage collection; keeping a low interval is useful in maps that have many changing metrics. This map will only have a handful of metrics.

I will make this change as well before merging the PR.

deepthi · 2021-08-19T16:34:34Z

The other thing is that we can increase the garbage-collect interval on most of these caches without affecting logic.

We have the following:

		throttler.throttledApps = cache.New(cache.NoExpiration, 10*time.Second)
		throttler.mysqlClusterThresholds = cache.New(cache.NoExpiration, 0)
		throttler.aggregatedMetrics = cache.New(5 * time.Second, 1 * time.Second)
		throttler.recentApps = cache.New(recentAppsExpiration, time.Minute)
		throttler.metricsHealth = cache.New(cache.NoExpiration, 0)
		throttler.nonLowPriorityAppRequestsThrottled = cache.New(time.Second, 100 * time.MilliSecond)

What are reasonable values for the cleanup intervals that are currently 1 second or less? Should I change those to 10 seconds? There are two of them: nonLowPriorityAppRequestsThrottled and aggregatedMetrics.

Signed-off-by: deepthi <[email protected]>

vmg · 2021-08-23T09:36:01Z

Are we really OK with the increase for the timeouts/intervals in production systems? Vitess knows via its env when it's running in a local deployment, so maybe it would be wise to check that before increasing those time values, and only doing so in local environments with the purpose of reducing CPU usage. The increased accuracy is probably always worth it in production environments, and the increase in CPU usage there will be negligible.

shlomi-noach · 2021-08-25T06:04:51Z

There are two of them: nonLowPriorityAppRequestsThrottled and aggregatedMetrics.

Both are fine to change to 10 * time.Second

Are we really OK with the increase for the timeouts/intervals in production systems?

Yes, allow me to explain. It's in how this particular cache implementation works. This cache is a KV map, where a value is a combination of:

arbitrary data (interface{})
timestamp of insertion
expiry duration

Whenever you Get(...) an item from the cache

if it doesn't exist, fine, return empty
if it exists, the cache checks whether timestamp of insertion + expiry duration < now. If so, it's as if the item does not exist, return empty
otherwise return the item

Thus far the logic is sound and complete. There is no need for garbage-collecting in terms of correctness of data.

Of course, caches can blow up with data, and the garbage collection, which is the topic of this discussion, is how the cache is cleaned up: the garbage collector iterates all items, computes if timestamp of insertion + expiry duration < now and proactively deletes items where the condition holds.

This is useful in caches that have many items, where new keys ar ebeign introduced and are short-lived.

However, in the lag-throttler caches, the number of items is very limited: lag computation cache size == number of servers in a shard (just a handful). Aggregated metrics == 2 (one for lag metric, one for primary metric). Cached check results size == number of app names the inquire for checks. This is also a handful (gh-ost, vreplication, ...). The user may define any arbitrary check name, but it's unlikely that the number will be huge.

Therefore, it's the same keys again and again, and there is no fear of bloating the cache size; hence, garbage collection can be very relaxed. It can probably be disabled altogether without impact; but a 10 * time.Second should go unnoticeable (or else we have bigger problems).

vmg · 2021-08-25T09:19:08Z

I'm convinced! 👍

throttler: don't allocate any resources unless it is actually enabled

791bd63

Signed-off-by: deepthi <[email protected]>

deepthi requested review from harshit-gangal, shlomi-noach and systay as code owners August 18, 2021 19:38

deepthi requested review from sougou and aquarapid and removed request for systay and harshit-gangal August 18, 2021 19:39

deepthi added Component: Cluster management Type: Performance labels Aug 18, 2021

shlomi-noach approved these changes Aug 19, 2021

View reviewed changes

throttler: increase cleanup interval for caches

fbeb9a6

Signed-off-by: deepthi <[email protected]>

shlomi-noach self-assigned this Aug 24, 2021

vmg approved these changes Aug 25, 2021

View reviewed changes

deepthi merged commit f0e9966 into vitessio:main Aug 25, 2021

deepthi deleted the ds-lag-throttler branch August 25, 2021 21:07

frouioui added the release notes label Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

throttler: don't allocate any resources unless it is actually enabled #8643

throttler: don't allocate any resources unless it is actually enabled #8643

deepthi commented Aug 18, 2021 •

edited

Loading

shlomi-noach left a comment

deepthi commented Aug 19, 2021

deepthi commented Aug 19, 2021

vmg commented Aug 23, 2021

shlomi-noach commented Aug 25, 2021

vmg commented Aug 25, 2021

throttler: don't allocate any resources unless it is actually enabled #8643

throttler: don't allocate any resources unless it is actually enabled #8643

Conversation

deepthi commented Aug 18, 2021 • edited Loading

Description

Related Issue(s)

Checklist

Deployment Notes

shlomi-noach left a comment

Choose a reason for hiding this comment

deepthi commented Aug 19, 2021

deepthi commented Aug 19, 2021

vmg commented Aug 23, 2021

shlomi-noach commented Aug 25, 2021

vmg commented Aug 25, 2021

deepthi commented Aug 18, 2021 •

edited

Loading