Plan Cache: Replace the cache implementation #7304

vmg · 2021-01-15T17:24:29Z

Hiii everyone! First PR to Vitess in a while. Let's get into it!

Description

The cache for planned queries, which exists both in the vtgates and the vttablets, has not been behaving well. We have had to disable these caches altogether because they were causing OOM errors. We believe the main cause triggering these errors are bulk insert queries, which come in different sizes for statement counts and hence will occupy too much cache space even after normalization before caching.

After some pairing with @sougou, we have considered the possibility of disabling caching altogether for INSERT statements, since these are the ones known to cause particular problems. Before doing this, I promised to gather some numbers on how expensive are the query plans for these statements to prepare. The results are not encouraging: on average, INSERT statements are just as expensive as any other statement to plan (as seen on the commit message for fe240ec), so disabling their caching is going to cause a performance regression.

Next Steps

Despite the fact that we've tracked down the OOM issues to batch-INSERT queries, I think the underlying issue is clear upon reviewing the implementation of our cache: the main issue we're facing (which batch inserts definitely exacerbate) is that our plan cache is too primitive for our use case. Most notably, it does not have an admission policy. Caches for database systems have historically always always had an admission policy, whose goal is preventing extreme corner cases from taking over the cache. In our case, this is batch inserts queries; an equivalent in a traditional relational database system would be a full-table scan, which would page into cache a lot of pages in the database that will only be read once. These kind of pathological access patterns cause cache pollution by bringing into cache a lot of data that is never going to be read again.

My belief is that we are going to see a significant performance improvement by replacing the LRUCache implementation in Vitess with a LFU w/ eviction policy that cannot be trivially polluted. I think would be a minimal effort which I'd like to undertake next week, and it should fix both the cache memory growth issues we're seeing and improve cache performance overall since the current implementation with a Map + Linked List uses a single global lock for the whole cache, which right now is actively being contended by all the query goroutines in a vtgate. Obviously, it would also allow us to enable INSERT plan caching again, since batch inserts ought to be a corner case that will no longer pollute our cache. This is something which I intend to verify in an integration test.

@sougou @enisoc I would like your feedback on this before I get started on Monday.

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

go/vt/vtgate/executor.go

systay · 2021-01-15T17:35:30Z

Nice stuff! I like your thinking, and you are probably 100% correct.

vmg · 2021-01-18T16:20:11Z

@systay I've started working on a new cache implementation. The first step has been reducing the API surface for the existing cache and abstracting it behind an interface so it can be hotswapped with a different one. I've marked the PR as a draft while I look at this. Would you mind getting started with the review of these changes? I've split them into individual commits that should be easier to review. Cheers!

systay · 2021-01-18T17:04:55Z

go/vt/vtgate/planbuilder/plan_test.go

+	var selectCases []testCase
+
+	for tc := range iterateExecFile("dml_cases.txt") {
+		if tc.output2ndPlanner != "" {


output2ndPlanner is an artifact of this project: #7280

it's the plan generated by the new gen4 planner. for this work, you should just ignore it and add all queries.

go/cache/cache.go

go/cache/lru_cache_test.go

vmg · 2021-01-18T17:15:58Z

Can I assume that the failure in the race detector in CI is a flake? It doesn't look related to any of my code changes.

systay · 2021-01-18T17:18:37Z

I think the reason the executor tries twice is to save on parsing, normalization and AST rewriting. I believe last we checked these where non-trivial parts of the full process. Don't trust my memory though - we measure to see if it is worth double checking or not.

systay · 2021-01-18T17:19:12Z

I really like your new cleaner API.

systay · 2021-01-18T17:23:18Z

Can I assume that the failure in the race detector in CI is a flake? It doesn't look related to any of my code changes.

#6067 it's a know flaky test. I've restarted it.

TestQueryStats is a real failure though, right?

vmg · 2021-01-19T16:17:35Z

I've just pushed the next two iterative steps that are going to be required for a proper cache implementation.

My work for today has consisted on fixing the most glaring shortcoming in the current cache: the fact that the size of the cache is measured in entries as opposed to bytes. Right now, both vtgate and vttablet have configuration options that let the user choose how many entries can a cache hold. This is not a good approach for production systems, because it means that the memory consumption of the cache cannot be tuned a priori, as the only configurable limit is the number of entries, which have wildly different sizes between them. I believe it's crucial for Vitess to be able to finely tune its cache behaviour in production systems based on the amount of memory available on a host.

Hence, commit d7c4172 introduces two new configuration settings for vtgate and vttablet that allow you to tune the exact memory size of the cache instead of the number of entries. The old configuration options for number of entries have been left as-is but marked as deprecated. Vitess now knows how to convert the old entry-count configuration, when explicitly provided by the user, into a rough memory limit based on the average size of a cache entry; this provides backwards compatibility until we're eventually ready to remove the old config options.

Lastly, in order to make the new cache limits work in practice, I've had to implement a rough calculation of memory consumption for all Plan objects. That is proving itself to be a pain! The implementation I'm proposing here is more-or-less accurate by manually recursing the data structures and adding up their consumption; I must say I'm not in love with it. Another possible approach would be using runtime reflection to measure the size (something which would require a lot less code but may potentially be slower). I intend to attempt such implementation tomorrow. I would love to hear your feedback on other possible ways of implementing this functionality that aren't overly verbose.

My remaining work for today is going to be getting the tests green. 🍏

systay · 2021-01-20T09:40:43Z

go/cache/null.go

+// nullCache is a no-op cache that does not store items
+type nullCache struct{}
+
+func (n *nullCache) Get(_ string) (interface{}, bool) {


nit: didn't the linter complain about comments?

systay · 2021-01-20T09:42:22Z

go/vt/vtgate/vtgate.go

+	// If the legacy queryPlanCacheSize is set, override the value of the new queryPlanCacheSizeBytes
+	// approximating the total size of the cache with the average size of an entry
+	if *queryPlanCacheSize != 0 {
+		*queryPlanCacheSizeBytes = *queryPlanCacheSize * engine.AveragePlanSize


should this lead to us logging a warning about a deprecated flag being used?

systay · 2021-01-20T09:44:55Z

go/vt/sqlparser/parsed_query.go

+	if pq == nil {
+		return 0
+	}
+	return int64(unsafe.Sizeof(ParsedQuery{})) +


historically, we have relegated the use of unsafe to https://github.com/vitessio/vitess/blob/master/go/hack/hack.go. Is it time to let go of that rule, @sougou? (I think it is)

systay · 2021-01-20T10:07:58Z

the commit cache: configure using total memory usage is missing the sign-off thingie

vmg · 2021-01-22T11:10:38Z

@systay I keep seeing two test failures that look very unrelated to my changes. Am I wrong here? Are the tests usually this flaky?

shlomi-noach · 2021-01-24T06:51:06Z

@vmg we do suffer from flakiness and are taking flaky tests down as time permits... It's unfortunately not uncommon .

There is no need to lookup ta given plan in the cache twice (in its normalized and non-normalized representation for its key): if the plan is normalized, it'll be stored into the cache in its normalized form. If it's not normalized, it'll be stored in its original form. Either way, the initial lookup with its non-normalized form is redundant. The raw `sql` content of a query only changes if the query has been normalized. In cases where it hasn't, there is no need to lookup the same key twice on cache Signed-off-by: Vicent Marti <[email protected]>

We have experienced caching issues with batch inserts in Vitess clusters, whose plans were polluting the shared plan cache. Before we can consider the trivial fix for this issue, which would be simply disabling caching for `INSERT` statements, we need to find out what's going to be the impact of disabling caching for this plans. Unfortunately, it looks like there isn't a significant performance difference between preparing a plan for an INSERT statement vs a SELECT one. Here's the output of two comparisons with a random sample of 32 of each statement: BenchmarkSelectVsDML/DML_(random_sample,_N=32) BenchmarkSelectVsDML/DML_(random_sample,_N=32)-16 766 1640575 ns/op 511073 B/op 6363 allocs/op BenchmarkSelectVsDML/Select_(random_sample,_N=32) BenchmarkSelectVsDML/Select_(random_sample,_N=32)-16 746 1479792 ns/op 274486 B/op 7730 allocs/op BenchmarkSelectVsDML/DML_(random_sample,_N=32) BenchmarkSelectVsDML/DML_(random_sample,_N=32)-16 823 1540039 ns/op 496079 B/op 5949 allocs/op BenchmarkSelectVsDML/Select_(random_sample,_N=32) BenchmarkSelectVsDML/Select_(random_sample,_N=32)-16 798 1526661 ns/op 275016 B/op 7730 allocs/op There is not a noticeable performance difference when preparing the INSERT statements. The only consistent metric is that INSERT statement plans allocate more memory than SELECT plans. Signed-off-by: Vicent Marti <[email protected]>

The current public API for the cache makes some assumptions that do not hold for more efficient cache implementations with admission policies. The following APIs have been replaced with equivalent ones or removed altogether: - `LRUCache.SetIfAbsent`: removed, not used - `LRUCache.Peek`: replaced with LRUCache.ForEach, since the original Peek was only used to iterate through the contents of the cache - `LRUCache.Keys`: likewise replaced with `ForEach` since the keys were only being accessed for iteration Signed-off-by: Vicent Marti <[email protected]>

The `cache.LRUCache` struct has now been abstracted behind a generic Cache interface so that it can be swapped with more complex Cache implementations. Signed-off-by: Vicent Marti <[email protected]>

Signed-off-by: Vicent Marti <[email protected]>

The existing pattern for vttablet/vtgate cache configuration is a dangerous practice, because it lets the user configure the number of items that can be stored in the cache, as opposed to the total amount of memory (approximately) that the cache will consume. This makes tuning production systems complicated, and will skew more complex cache implementations that use size-aware eviction policies. To fix this, we're deprecating the original config settings for cache tuning, and introducing new ones where the total size of the cache is defined in BYTES as opposed to ENTRIES. To maintain backwards compatibility, if the user supplies the legacy config options with number of ENTRIES, we'll calculate an approximate total size for the cache based on the average size of a cache entry for each given cache. Signed-off-by: Vicent Marti <[email protected]>

Signed-off-by: Vicent Marti <[email protected]>

vmg · 2021-02-03T17:06:01Z

Replaced with #7439

vmg requested review from harshit-gangal and systay as code owners January 15, 2021 17:24

vmg force-pushed the vmg/plan-cache branch from 9320703 to 1ade936 Compare January 15, 2021 17:25

systay reviewed Jan 15, 2021

View reviewed changes

go/vt/vtgate/executor.go Show resolved Hide resolved

vmg force-pushed the vmg/plan-cache branch from 1ade936 to e354337 Compare January 18, 2021 11:40

vmg requested review from shlomi-noach and sougou as code owners January 18, 2021 11:40

vmg marked this pull request as draft January 18, 2021 16:04

vmg force-pushed the vmg/plan-cache branch from f7238bf to 7011097 Compare January 18, 2021 16:18

vmg changed the title ~~Plan Cache: Do not store INSERT statements~~ Plan Cache: Replace the cache implementation Jan 18, 2021

systay reviewed Jan 18, 2021

View reviewed changes

go/cache/cache.go Show resolved Hide resolved

systay reviewed Jan 18, 2021

View reviewed changes

go/cache/lru_cache_test.go Outdated Show resolved Hide resolved

vmg force-pushed the vmg/plan-cache branch from aded5c1 to 17d678d Compare January 19, 2021 10:23

systay reviewed Jan 20, 2021

View reviewed changes

vmg force-pushed the vmg/plan-cache branch 4 times, most recently from b5b9d54 to c5ee6cf Compare January 22, 2021 10:39

vmg mentioned this pull request Jan 26, 2021

Cached Size Implementation #7387

Merged

8 tasks

vmg force-pushed the vmg/plan-cache branch 5 times, most recently from 8d46c78 to ae52b3c Compare February 1, 2021 11:44

vmg added 13 commits February 2, 2021 11:27

cache: abstract into a Cache interface

cbbba9e

The `cache.LRUCache` struct has now been abstracted behind a generic Cache interface so that it can be swapped with more complex Cache implementations. Signed-off-by: Vicent Marti <[email protected]>

tools: do not cache E2E tests between runs

76bd931

Signed-off-by: Vicent Marti <[email protected]>

cache: do not return nil stats

1093c69

Signed-off-by: Vicent Marti <[email protected]>

cache: fix flaky memory usage test

2e3e0c0

Signed-off-by: Vicent Marti <[email protected]>

cache: switch to a new implementation based on Ristretto

d6a13f1

Signed-off-by: Vicent Marti <[email protected]>

endtoend: fix test values

1cade52

Signed-off-by: Vicent Marti <[email protected]>

plan builder: use standard V3 planner

5f2a612

Signed-off-by: Vicent Marti <[email protected]>

cache: make unit test more reliable

4c35cfa

Signed-off-by: Vicent Marti <[email protected]>

cache: speed up clearing large caches

70c8038

Signed-off-by: Vicent Marti <[email protected]>

vmg force-pushed the vmg/plan-cache branch from 19086d0 to 70c8038 Compare February 2, 2021 11:28

vmg added 3 commits February 2, 2021 17:50

cache: make the cache implementation swappable

f681e00

Signed-off-by: Vicent Marti <[email protected]>

cache: fix DropUpdates test

7c0c56f

Signed-off-by: Vicent Marti <[email protected]>

cache: use the legacy LRU cache by default

f7ad521

Signed-off-by: Vicent Marti <[email protected]>

vmg force-pushed the vmg/plan-cache branch from 5dbba09 to f7ad521 Compare February 3, 2021 14:39

vmg closed this Feb 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan Cache: Replace the cache implementation #7304

Plan Cache: Replace the cache implementation #7304

vmg commented Jan 15, 2021 •

edited

Loading

systay commented Jan 15, 2021

vmg commented Jan 18, 2021

systay Jan 18, 2021

vmg commented Jan 18, 2021 •

edited

Loading

systay commented Jan 18, 2021

systay commented Jan 18, 2021

systay commented Jan 18, 2021

vmg commented Jan 19, 2021

systay Jan 20, 2021

systay Jan 20, 2021

systay Jan 20, 2021

systay commented Jan 20, 2021

vmg commented Jan 22, 2021

shlomi-noach commented Jan 24, 2021

vmg commented Feb 3, 2021

Plan Cache: Replace the cache implementation #7304

Plan Cache: Replace the cache implementation #7304

Conversation

vmg commented Jan 15, 2021 • edited Loading

Description

Next Steps

Checklist

Deployment Notes

Impacted Areas in Vitess

systay commented Jan 15, 2021

vmg commented Jan 18, 2021

systay Jan 18, 2021

Choose a reason for hiding this comment

vmg commented Jan 18, 2021 • edited Loading

systay commented Jan 18, 2021

systay commented Jan 18, 2021

systay commented Jan 18, 2021

vmg commented Jan 19, 2021

systay Jan 20, 2021

Choose a reason for hiding this comment

systay Jan 20, 2021

Choose a reason for hiding this comment

systay Jan 20, 2021

Choose a reason for hiding this comment

systay commented Jan 20, 2021

vmg commented Jan 22, 2021

shlomi-noach commented Jan 24, 2021

vmg commented Feb 3, 2021

vmg commented Jan 15, 2021 •

edited

Loading

vmg commented Jan 18, 2021 •

edited

Loading