-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LFU Cache Implementation #7439
LFU Cache Implementation #7439
Conversation
There is no need to lookup ta given plan in the cache twice (in its normalized and non-normalized representation for its key): if the plan is normalized, it'll be stored into the cache in its normalized form. If it's not normalized, it'll be stored in its original form. Either way, the initial lookup with its non-normalized form is redundant. The raw `sql` content of a query only changes if the query has been normalized. In cases where it hasn't, there is no need to lookup the same key twice on cache Signed-off-by: Vicent Marti <[email protected]>
We have experienced caching issues with batch inserts in Vitess clusters, whose plans were polluting the shared plan cache. Before we can consider the trivial fix for this issue, which would be simply disabling caching for `INSERT` statements, we need to find out what's going to be the impact of disabling caching for this plans. Unfortunately, it looks like there isn't a significant performance difference between preparing a plan for an INSERT statement vs a SELECT one. Here's the output of two comparisons with a random sample of 32 of each statement: BenchmarkSelectVsDML/DML_(random_sample,_N=32) BenchmarkSelectVsDML/DML_(random_sample,_N=32)-16 766 1640575 ns/op 511073 B/op 6363 allocs/op BenchmarkSelectVsDML/Select_(random_sample,_N=32) BenchmarkSelectVsDML/Select_(random_sample,_N=32)-16 746 1479792 ns/op 274486 B/op 7730 allocs/op BenchmarkSelectVsDML/DML_(random_sample,_N=32) BenchmarkSelectVsDML/DML_(random_sample,_N=32)-16 823 1540039 ns/op 496079 B/op 5949 allocs/op BenchmarkSelectVsDML/Select_(random_sample,_N=32) BenchmarkSelectVsDML/Select_(random_sample,_N=32)-16 798 1526661 ns/op 275016 B/op 7730 allocs/op There is not a noticeable performance difference when preparing the INSERT statements. The only consistent metric is that INSERT statement plans allocate more memory than SELECT plans. Signed-off-by: Vicent Marti <[email protected]>
The current public API for the cache makes some assumptions that do not hold for more efficient cache implementations with admission policies. The following APIs have been replaced with equivalent ones or removed altogether: - `LRUCache.SetIfAbsent`: removed, not used - `LRUCache.Peek`: replaced with LRUCache.ForEach, since the original Peek was only used to iterate through the contents of the cache - `LRUCache.Keys`: likewise replaced with `ForEach` since the keys were only being accessed for iteration Signed-off-by: Vicent Marti <[email protected]>
The `cache.LRUCache` struct has now been abstracted behind a generic Cache interface so that it can be swapped with more complex Cache implementations. Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
The existing pattern for vttablet/vtgate cache configuration is a dangerous practice, because it lets the user configure the number of items that can be stored in the cache, as opposed to the total amount of memory (approximately) that the cache will consume. This makes tuning production systems complicated, and will skew more complex cache implementations that use size-aware eviction policies. To fix this, we're deprecating the original config settings for cache tuning, and introducing new ones where the total size of the cache is defined in BYTES as opposed to ENTRIES. To maintain backwards compatibility, if the user supplies the legacy config options with number of ENTRIES, we'll calculate an approximate total size for the cache based on the average size of a cache entry for each given cache. Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
go/cache/cache.go
Outdated
// Regardless of the default value used here, the user can always override Vitess' configuration to | ||
// force a specific cache type (e.g. when passing a value in ENTRIES to vtgate, the service will use | ||
// a LRU cache). | ||
const DefaultCacheSize = SizeInEntries(5000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the global toggle that will allow us to switch the default cache implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clever! Correct me if I'm wrong, but it seems to me that if we toggle this, then the code in vtgate.go
would have to change to allow use of the Entries
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct! I should probably flag that. The code in vtgate.go
is written so that when the LRU cache is the default, setting a LFU cache flag overrides it. The opposite needs to be true when the LFU cache is the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question: how do we ensure that the new cache implementation is green in CI? I've manually verified that switching to the LFU cache by default keeps all the tests green (as it's fully backwards compatible), but ideally there'd be a separate test suite runner that used the new cache by default. The alternative would be to simply duplicate all cache-related unit tests so they run in a single suite with both cache implementations: I don't quite like this idea because there are many side-effects of cache behavior that are tested in end2end tests and that wouldn't be run constantly if we simply add some extra unit tests for the new cache. |
If we think this is a better implementation, it is perfectly reasonable to default to the new cache while allowing users to choose the old cache through a command-line option. |
There is no perfect answer to this. If we agree that we should make the new cache the default, then the compromise solution would be
|
Signed-off-by: Vicent Marti <[email protected]>
I sure hope so! Otherwise this project would have been kind of pointless. :) |
Signed-off-by: Vicent Marti <[email protected]>
Merged master which was conflicting after some stats changes. |
Good news everyone! I brought The Good Graphs (TM) to prove this is a better implementation. I've synthesized a test environment to reproduce the common case of cache pollution we've been seeing in production environments. It consists of two query streams: one is performing queries sampled from a fixed subset of queries (i.e. it plays the same queries over and over again in a random sequence, similar to a production system, which has a fixed number of "SQL queries" being performed), and the other performs essentially random SQL queries which would (in theory) clutter our cache. In an ideal system, the hit rate for the first stream would be near 100%, whilst the hit rate for the second one must necessarily be 0 since the queries are random, so they will not hit twice in cache. Let's take a look at the behavior of our hit rates with the old and new cache implementations: Old cache implementation (LRU)The orange line represents the stream of polluting queries ramping up. Its effects in the hit rate for "good queries" (the purple line) are immediately noticeable. As soon as we start executing polluting queries, the hit rate for the good queries takes a nosedive: from 90%+ to 40%. The cache has been taken over by the random queries. As soon as we're done executing the polluting queries (when the orange line reaches 100%), the good queries start taking over the cache again, but do so way slower than at the beginning. The cache, at that point, is composed randomly of "good" production queries and pollution, and as more "good" queries come in, our LRU algorithm doesn't know what queries to evict to make space for them, so sometimes it evicts pollution, and sometimes it evicts good queries, further delaying the recovery of our cache back to 90%+ hit rates. Also worth noting: the response times for the "good queries" is through the roof: it's always higher than the response times for polluting queries, even though the polluted queries have a 0% hit rate. This is because the good queries (in this synthetic benchmark composed of SELECTs) take longer to plan than the polluting queries, which are simple inserts. As soon as all the polluting queries are done executing, the response time for good queries improves dramatically, because they're no longer being planned, they're being served from cache. This is also evidenced by the progress of the two query streams (blue and orange lines): you can see how as soon as the polluting query stream starts ramping up, the "good queries" significantly slow down; a direct result of the reduced cache hit rates. New cache implementation (LFU)Now this is podracing. Things worth pointing out: the LFU cache handles the onslaught of polluting queries like a champ! Our hit rate for good queries never goes below 85%! It degrades very slowly over time, as the LFU admission algorithm is probabilistic and hence makes mistakes, but the degradation is extremely slow. Very few polluting queries make it into the cache! Likewise, look at the time it takes to execute the bulk of polluting queries (as measured by the ramp/inclination of the orange line). You can see how the execution time of good queries (blue line) is not affected by the polluting query stream, and how the polluting stream ramps up significantly slower than with the LRU cache. If we look at the scatter points for response times, we can see how the performance of good vs polluting queries has swapped places. The good queries are now consistently faster than the polluting ones, as they should, because they're always being served off cache. Overall, I think these graphs prove the new implementation is a big win and meets all the success criteria I had designed beforehand. Really happy with the results here -- definitely good enough to switch the new LFU cache to be the new default in this PR, which I intend to do tomorrow morning. 👌 |
is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice stuff. Could only find minor details and nitpicks.
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Yes! This empty Goassembly file tricks the Go compiler into thinking that the other Go files in the local package have external linkage dependencies (which they do -- they're trying to link against the local runtime). I've updated the file with a comment that explains why it's there. |
@systay Just handled all feedback and improved the way the cache is configured. I would like to merge this as-is with the legacy LRU cache as the default, and next think I'll open a separate PR to switch to the new cache, if that seems reasonable to you. |
This would make for a very cool blog post 😉 |
Hi everyone!
Here's the first version of the new LFU Cache implementation ready to review.
For context: The cache for planned queries, which exists both in the
vtgates
and thevttablets
, has not been behaving well. We have had to disable these caches altogether because they were causing OOM errors. We believe the main cause triggering these errors are bulk insert queries, which come in different sizes for statement counts and hence will occupy too much cache space even after normalization before caching.In order to fix this, and several other issues I've found while reviewing the code, I'm bringing a brand new cache implementation. There are 3 key features of this new cache implementation which are strict improvements over the old one:
In order to bring in the new cache implementation into the codebase, I've refactored all cache access behind interfaces. This required some minor changes to the
LRUCache
API, although its behavior should be identical to the one it had before (the tests verify just that). Afterwards, I brought in a new cache implementation that implements the same API as our new unified cache interface. The new implementation is a heavily modified version of Ristretto, which is an Apache v2 licensed library. I briefly considered bringing in the cache as an external dependency, but it required so many modifications in order to match the exact behaviors that Vitess was expecting that I leaned towards importing it whole (it clocks at a few thousand lines of code, including extensive tests).Lastly, I updated the places caches are being created so that it's always safe to swap between the LRU and LFU cache implementations. In order to ensure backwards compatibility, this PR does not enable the new cache by default so its behavior should be identical to what's in the trunk right now. A global switch in
cache.go
is available to control the default value for cache capacity: changing that from entries to memory bytes will instantly switch the default configuration arguments, commandline arguments, tests, etc, to default to using the LFU cache. Of course, regardless of the default, users can explicitly specify a memory usage in bytes to opt in to the new cache behavior.Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: