-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan Cache: Replace the cache implementation #7304
Conversation
Nice stuff! I like your thinking, and you are probably 100% correct. |
@systay I've started working on a new cache implementation. The first step has been reducing the API surface for the existing cache and abstracting it behind an |
var selectCases []testCase | ||
|
||
for tc := range iterateExecFile("dml_cases.txt") { | ||
if tc.output2ndPlanner != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output2ndPlanner
is an artifact of this project: #7280
it's the plan generated by the new gen4 planner. for this work, you should just ignore it and add all queries.
Can I assume that the failure in the race detector in CI is a flake? It doesn't look related to any of my code changes. |
I think the reason the executor tries twice is to save on parsing, normalization and AST rewriting. I believe last we checked these where non-trivial parts of the full process. Don't trust my memory though - we measure to see if it is worth double checking or not. |
I really like your new cleaner API. |
#6067 it's a know flaky test. I've restarted it.
|
I've just pushed the next two iterative steps that are going to be required for a proper cache implementation. My work for today has consisted on fixing the most glaring shortcoming in the current cache: the fact that the size of the cache is measured in entries as opposed to bytes. Right now, both Hence, commit d7c4172 introduces two new configuration settings for Lastly, in order to make the new cache limits work in practice, I've had to implement a rough calculation of memory consumption for all My remaining work for today is going to be getting the tests green. 🍏 |
// nullCache is a no-op cache that does not store items | ||
type nullCache struct{} | ||
|
||
func (n *nullCache) Get(_ string) (interface{}, bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: didn't the linter complain about comments?
go/vt/vtgate/vtgate.go
Outdated
// If the legacy queryPlanCacheSize is set, override the value of the new queryPlanCacheSizeBytes | ||
// approximating the total size of the cache with the average size of an entry | ||
if *queryPlanCacheSize != 0 { | ||
*queryPlanCacheSizeBytes = *queryPlanCacheSize * engine.AveragePlanSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this lead to us logging a warning about a deprecated flag being used?
go/vt/sqlparser/parsed_query.go
Outdated
if pq == nil { | ||
return 0 | ||
} | ||
return int64(unsafe.Sizeof(ParsedQuery{})) + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
historically, we have relegated the use of unsafe to https://github.com/vitessio/vitess/blob/master/go/hack/hack.go. Is it time to let go of that rule, @sougou? (I think it is)
the commit |
b5b9d54
to
c5ee6cf
Compare
@systay I keep seeing two test failures that look very unrelated to my changes. Am I wrong here? Are the tests usually this flaky? |
@vmg we do suffer from flakiness and are taking flaky tests down as time permits... It's unfortunately not uncommon . |
8d46c78
to
ae52b3c
Compare
There is no need to lookup ta given plan in the cache twice (in its normalized and non-normalized representation for its key): if the plan is normalized, it'll be stored into the cache in its normalized form. If it's not normalized, it'll be stored in its original form. Either way, the initial lookup with its non-normalized form is redundant. The raw `sql` content of a query only changes if the query has been normalized. In cases where it hasn't, there is no need to lookup the same key twice on cache Signed-off-by: Vicent Marti <[email protected]>
We have experienced caching issues with batch inserts in Vitess clusters, whose plans were polluting the shared plan cache. Before we can consider the trivial fix for this issue, which would be simply disabling caching for `INSERT` statements, we need to find out what's going to be the impact of disabling caching for this plans. Unfortunately, it looks like there isn't a significant performance difference between preparing a plan for an INSERT statement vs a SELECT one. Here's the output of two comparisons with a random sample of 32 of each statement: BenchmarkSelectVsDML/DML_(random_sample,_N=32) BenchmarkSelectVsDML/DML_(random_sample,_N=32)-16 766 1640575 ns/op 511073 B/op 6363 allocs/op BenchmarkSelectVsDML/Select_(random_sample,_N=32) BenchmarkSelectVsDML/Select_(random_sample,_N=32)-16 746 1479792 ns/op 274486 B/op 7730 allocs/op BenchmarkSelectVsDML/DML_(random_sample,_N=32) BenchmarkSelectVsDML/DML_(random_sample,_N=32)-16 823 1540039 ns/op 496079 B/op 5949 allocs/op BenchmarkSelectVsDML/Select_(random_sample,_N=32) BenchmarkSelectVsDML/Select_(random_sample,_N=32)-16 798 1526661 ns/op 275016 B/op 7730 allocs/op There is not a noticeable performance difference when preparing the INSERT statements. The only consistent metric is that INSERT statement plans allocate more memory than SELECT plans. Signed-off-by: Vicent Marti <[email protected]>
The current public API for the cache makes some assumptions that do not hold for more efficient cache implementations with admission policies. The following APIs have been replaced with equivalent ones or removed altogether: - `LRUCache.SetIfAbsent`: removed, not used - `LRUCache.Peek`: replaced with LRUCache.ForEach, since the original Peek was only used to iterate through the contents of the cache - `LRUCache.Keys`: likewise replaced with `ForEach` since the keys were only being accessed for iteration Signed-off-by: Vicent Marti <[email protected]>
The `cache.LRUCache` struct has now been abstracted behind a generic Cache interface so that it can be swapped with more complex Cache implementations. Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
The existing pattern for vttablet/vtgate cache configuration is a dangerous practice, because it lets the user configure the number of items that can be stored in the cache, as opposed to the total amount of memory (approximately) that the cache will consume. This makes tuning production systems complicated, and will skew more complex cache implementations that use size-aware eviction policies. To fix this, we're deprecating the original config settings for cache tuning, and introducing new ones where the total size of the cache is defined in BYTES as opposed to ENTRIES. To maintain backwards compatibility, if the user supplies the legacy config options with number of ENTRIES, we'll calculate an approximate total size for the cache based on the average size of a cache entry for each given cache. Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Replaced with #7439 |
Hiii everyone! First PR to Vitess in a while. Let's get into it!
Description
The cache for planned queries, which exists both in the
vtgates
and thevttablets
, has not been behaving well. We have had to disable these caches altogether because they were causing OOM errors. We believe the main cause triggering these errors are bulk insert queries, which come in different sizes for statement counts and hence will occupy too much cache space even after normalization before caching.After some pairing with @sougou, we have considered the possibility of disabling caching altogether for
INSERT
statements, since these are the ones known to cause particular problems. Before doing this, I promised to gather some numbers on how expensive are the query plans for these statements to prepare. The results are not encouraging: on average,INSERT
statements are just as expensive as any other statement to plan (as seen on the commit message for fe240ec), so disabling their caching is going to cause a performance regression.Next Steps
Despite the fact that we've tracked down the OOM issues to batch-
INSERT
queries, I think the underlying issue is clear upon reviewing the implementation of our cache: the main issue we're facing (which batch inserts definitely exacerbate) is that our plan cache is too primitive for our use case. Most notably, it does not have an admission policy. Caches for database systems have historically always always had an admission policy, whose goal is preventing extreme corner cases from taking over the cache. In our case, this is batch inserts queries; an equivalent in a traditional relational database system would be a full-table scan, which would page into cache a lot of pages in the database that will only be read once. These kind of pathological access patterns cause cache pollution by bringing into cache a lot of data that is never going to be read again.My belief is that we are going to see a significant performance improvement by replacing the
LRUCache
implementation in Vitess with a LFU w/ eviction policy that cannot be trivially polluted. I think would be a minimal effort which I'd like to undertake next week, and it should fix both the cache memory growth issues we're seeing and improve cache performance overall since the current implementation with a Map + Linked List uses a single global lock for the whole cache, which right now is actively being contended by all the query goroutines in avtgate
. Obviously, it would also allow us to enableINSERT
plan caching again, since batch inserts ought to be a corner case that will no longer pollute our cache. This is something which I intend to verify in an integration test.@sougou @enisoc I would like your feedback on this before I get started on Monday.
Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: