-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: zero-copy tokenizer #7619
Conversation
Very confused about this error btw:
this file is not on my local checkout of Vitess 👻 |
You probably need to rebase with the latest master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for showing me the way senpai 🙇♂️
This is awesome! |
🤯 |
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Green! Ready for review. |
go/vt/sqlparser/token.go
Outdated
} | ||
} | ||
|
||
func (tkn *Tokenizer) scanStringSlow(buffer *bytes2.Buffer, delim uint16, typ int) (int, []byte) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a few comments to the methods would be nice. it's not very easy to understand when we need to fallback on the slow
methods, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that comments are desired for many functions.
Signed-off-by: Vicent Marti <[email protected]>
Added comments for all the scanner functions which should make the whole tokenization process more obvious. |
Signed-off-by: Vicent Marti <[email protected]>
☠️⚠️ WARNING: This is a very spicy PR. Please acquire a glass of milk 🥛 before reading ⚠️ ☠️
Description
Alright, here's the spiciest thing I've managed to do to the
sqlparser
this week. This is a revival of @GuptaManan100's #6898, which didn't quite work out and wasn't merged at the end. The goal of the original PR was to make thesqlparser.Tokenizer
work onstring
s, as opposed to[]byte
slices, in order to optimize the input to the parser system. Since the input SQL queries in Vitess are always stored instring
s, and our tokenizer was operating in[]byte
, creating a tokenizer means copying the whole string before each parse (for those following at home: this is becausestring
in Go is immutable; if we convert astring
to a[]byte
, they cannot share the same underlying storage because the new slice is actually mutable, and writing to it would change the contents of the sourcestring
).Changing the whole system to use
string
was an interesting optimization because it removed the copy of the input queries, but when benchmarked, the resulting parser was actually slower in all cases. Why did the optimization fail? The reason is because the original PR was adding more byte-copies than it was removing. The parser implementation as it is right now is not zero-copy. The Lexer creates small buffers for every single token that needs to be passed to the Parser, and as the bytes of the SQL query are processed, they're added to this temporary buffer and once a token is completed in the buffer, the buffer is returned.This is what caused the massive slowdown in #6898: even though now we were not performing the initial copy when creating a Tokenizer, every single token yielded by the tokenizer meant creating a temporary
[]byte
buffer to push the token and then creating another extra copy of the temporary buffer to return it as astring
! The original parser didn't have to create the extra copy of each token because it was returning the temporary buffer directly, but if we want to returnstring
s, we need two copies per token instead of one! Oooh! So that's why we didn't see a performance win!The original #6898 attempts to work around this double-copy issue by implementing a copy-on-write system on the Tokenizer. The idea is, essentially, reusing the temporary buffers as much as possible, so we're amortizing the cost of allocating the bytes, even though we still have to create copies of the buffer to turn them into
string
. This is a good effort, but exceedingly hard to get right, so the results in the benchmarks were inconclusive: some of the parser benchmarks were made slower and some others were made faster, depending on the patterns of buffer reuse.Implementation details for this PR
So, what is the right way to implement this optimization? In order for
string
tokenization to be worth it at all, we need a 1-2 punch, which is what this PR attempts. 927487a rewrites the tokenizer so it becomes zero-copy even though it still uses[]byte
everywhere. Anything that is actually creating temporary buffers to store tokens in them is going to result in a slowdown when we switch tostring
-based tokenization, so we need to get rid of buffers everywhere. It turns out this is doable, although it requires a slight rewrite of the tokenizer. With some classical compiler theory design (i.e. a byte tokenizer that only peeks and skips), we can tokenize any SQL by returning slices off the original input byte slice. These slices do not allocate because their underlying storage is the same as the the input SQL buffer. The only exception to this? Strings and identifiers that contain escape sequences in them. These escape sequences need to be unescaped before being yielded by the tokenizer, but this is the exceptional case: we can design the tokenizer so that by default SQL "strings" are returned directly from the input buffer, and we only fall back to allocating a temporary buffer the first time we find a escape character in the given string.It turns out that once the tokenizer is re-implemented to be zero-copy, this optimization is more impactful than actually switching to
string
-based tokenization. In fact, thestring
tokenization is the cherry on top, and it becomes trivial to implement once the tokenizer has been made zero-copy, giving another small performance increase for free, and IMO much cleaner APIs.The only tricky thing about the port to using
string
s for tokenization is the temporary buffers for escaped strings. The original tokenizer, and the zero-copy bytes tokenizer use the efficientbytes2.Buffer
implementation to allocate these temporary buffers, but when switching tostring
-based tokenization, it's important to switch to the standard library'sstrings.Builder
, as it does a very important optimization for us: since the underlying temporary buffer for the builder will never be accessed directly, it can cast its underlying storage to astring
usingunsafe
code, as opposed to copying it safely, giving us a significant reduction in allocations in pathological queries where all SQL strings contain escape sequences.Benchmark tables
Alright alright and now for the moment you've been waiting for:
yeeeah now that's some good shit. We're looking at 6 benchmarks, which I've improved for this particular PR. The baseline already includes the perfect table lookups PR from yesterday even though I haven't merged them yet. If we compare this branch against
master
it's another 5% even faster on average.To note:
django_queries.txt
is a sample trace of queries from a Django web application, so these are realistic production queries. This is a 25% performance win in a real-world scenario and this is a full AST parse benchmark, so we're scrapping 25% performance in the whole parser pipeline just from optimizing the tokenizer.lobsters.sql.gz
is a sample trace of queries from a real Rails application (https://lobste.rs), a Reddit-like site aggregator. I got it running locally and dumped the MySQL logs to get a real-world sample of ActiveRecord-generated SQL -- something we really want to be optimizing for. The result is a 22.24% improvement, almost as good as the Django queries. The reduction on generated garbage is massive however: 4MB (30%) less of heap allocations per benchmark run. This is going to have a significant impact when running Vitess with real-world Rails applications by reducing GC churn and GC times.sql0
andsql1
are the two classic queries that thesqlparser
package has always used for benchmarking. Nothing too interesting about these queries, they're synthetic and yet they get significantly faster.Parse3/normal
andescaped
are pathological queries with huge SQL strings embedded in them.normal
has plaintext strings andescaped
is the worst-case-scenario where every single string has escape characters in them. The results innormal
are outrageous because the zero-copy tokenizer is just returning huge chunks of the original input without allocating a single byte. 50% performance increase, and --of course-- a 99% reduction (from 1.81mb to a few bytes) in memory allocations per parse. Theescaped
case, which is truly pathological, manages to parse massive strings with escape sequences by lazy-copying them, so the result is just as fast as with the old implementation while reducing memory allocations significantly (this is thanks to thestrings.Builder
trick to skip double copies).Overall, this is significantly more performance than what I thought would be possible to squeeze just from the tokenizer. I still have to tackle the actual
yacc
parser next week and can't wait to see what comes out of that.Thanks to @GuptaManan100 for the original inspiration for this PR! Really happy we can land a version of #6898, particularly with the ergonomic improvements it implies for internal Vitess APIs, which now take strings instead of byte slices.
Related Issue(s)
Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: