perf: zero-copy tokenizer #7619

vmg · 2021-03-05T16:38:52Z

☠️ ⚠️ WARNING: This is a very spicy PR. Please acquire a glass of milk 🥛 before reading ⚠️ ☠️

Description

Alright, here's the spiciest thing I've managed to do to the sqlparser this week. This is a revival of @GuptaManan100's #6898, which didn't quite work out and wasn't merged at the end. The goal of the original PR was to make the sqlparser.Tokenizer work on strings, as opposed to []byte slices, in order to optimize the input to the parser system. Since the input SQL queries in Vitess are always stored in strings, and our tokenizer was operating in []byte, creating a tokenizer means copying the whole string before each parse (for those following at home: this is because string in Go is immutable; if we convert a string to a []byte, they cannot share the same underlying storage because the new slice is actually mutable, and writing to it would change the contents of the source string).

Changing the whole system to use string was an interesting optimization because it removed the copy of the input queries, but when benchmarked, the resulting parser was actually slower in all cases. Why did the optimization fail? The reason is because the original PR was adding more byte-copies than it was removing. The parser implementation as it is right now is not zero-copy. The Lexer creates small buffers for every single token that needs to be passed to the Parser, and as the bytes of the SQL query are processed, they're added to this temporary buffer and once a token is completed in the buffer, the buffer is returned.

This is what caused the massive slowdown in #6898: even though now we were not performing the initial copy when creating a Tokenizer, every single token yielded by the tokenizer meant creating a temporary []byte buffer to push the token and then creating another extra copy of the temporary buffer to return it as a string! The original parser didn't have to create the extra copy of each token because it was returning the temporary buffer directly, but if we want to return strings, we need two copies per token instead of one! Oooh! So that's why we didn't see a performance win!

The original #6898 attempts to work around this double-copy issue by implementing a copy-on-write system on the Tokenizer. The idea is, essentially, reusing the temporary buffers as much as possible, so we're amortizing the cost of allocating the bytes, even though we still have to create copies of the buffer to turn them into string. This is a good effort, but exceedingly hard to get right, so the results in the benchmarks were inconclusive: some of the parser benchmarks were made slower and some others were made faster, depending on the patterns of buffer reuse.

Implementation details for this PR

So, what is the right way to implement this optimization? In order for string tokenization to be worth it at all, we need a 1-2 punch, which is what this PR attempts. 927487a rewrites the tokenizer so it becomes zero-copy even though it still uses []byte everywhere. Anything that is actually creating temporary buffers to store tokens in them is going to result in a slowdown when we switch to string-based tokenization, so we need to get rid of buffers everywhere. It turns out this is doable, although it requires a slight rewrite of the tokenizer. With some classical compiler theory design (i.e. a byte tokenizer that only peeks and skips), we can tokenize any SQL by returning slices off the original input byte slice. These slices do not allocate because their underlying storage is the same as the the input SQL buffer. The only exception to this? Strings and identifiers that contain escape sequences in them. These escape sequences need to be unescaped before being yielded by the tokenizer, but this is the exceptional case: we can design the tokenizer so that by default SQL "strings" are returned directly from the input buffer, and we only fall back to allocating a temporary buffer the first time we find a escape character in the given string.

It turns out that once the tokenizer is re-implemented to be zero-copy, this optimization is more impactful than actually switching to string-based tokenization. In fact, the string tokenization is the cherry on top, and it becomes trivial to implement once the tokenizer has been made zero-copy, giving another small performance increase for free, and IMO much cleaner APIs.

The only tricky thing about the port to using strings for tokenization is the temporary buffers for escaped strings. The original tokenizer, and the zero-copy bytes tokenizer use the efficient bytes2.Buffer implementation to allocate these temporary buffers, but when switching to string-based tokenization, it's important to switch to the standard library's strings.Builder, as it does a very important optimization for us: since the underlying temporary buffer for the builder will never be accessed directly, it can cast its underlying storage to a string using unsafe code, as opposed to copying it safely, giving us a significant reduction in allocations in pathological queries where all SQL strings contain escape sequences.

Benchmark tables

Alright alright and now for the moment you've been waiting for:

name                               old time/op    new time/op    delta
ParseTraces/django_queries.txt-16    3.27ms ± 4%    2.47ms ± 2%  -24.54%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16        137ms ± 3%     107ms ± 2%  -22.24%  (p=0.008 n=5+5)
ParseStress/sql0-16                  16.2µs ± 2%    13.0µs ± 2%  -19.70%  (p=0.008 n=5+5)
ParseStress/sql1-16                  51.1µs ± 1%    40.8µs ± 2%  -20.17%  (p=0.008 n=5+5)
Parse3/normal-16                     3.47ms ± 3%    1.74ms ± 2%  -49.67%  (p=0.008 n=5+5)
Parse3/escaped-16                    5.79ms ± 1%    5.83ms ± 1%     ~     (p=0.222 n=5+5)

name                               old alloc/op   new alloc/op   delta
ParseTraces/django_queries.txt-16     325kB ± 0%     211kB ± 0%  -35.07%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16       13.6MB ± 0%     9.6MB ± 0%  -29.33%  (p=0.008 n=5+5)
ParseStress/sql0-16                  1.65kB ± 0%    1.43kB ± 0%     ~     (p=0.079 n=4+5)
ParseStress/sql1-16                  5.57kB ± 0%    4.78kB ± 0%  -14.22%  (p=0.008 n=5+5)
Parse3/normal-16                     1.81MB ± 0%    0.00MB ± 0%  -99.83%  (p=0.008 n=5+5)
Parse3/escaped-16                    6.36MB ± 0%    5.23MB ± 0%  -17.80%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ParseTraces/django_queries.txt-16     8.80k ± 0%     3.28k ± 0%  -62.75%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16         364k ± 0%      154k ± 0%     ~     (p=0.079 n=4+5)
ParseStress/sql0-16                    41.0 ± 0%      24.0 ± 0%  -41.46%  (p=0.008 n=5+5)
ParseStress/sql1-16                     137 ± 0%        69 ± 0%  -49.64%  (p=0.008 n=5+5)
Parse3/normal-16                        112 ± 0%        52 ± 0%  -53.57%  (p=0.008 n=5+5)
Parse3/escaped-16                       336 ± 0%       294 ± 0%  -12.40%  (p=0.008 n=5+5)

yeeeah now that's some good shit. We're looking at 6 benchmarks, which I've improved for this particular PR. The baseline already includes the perfect table lookups PR from yesterday even though I haven't merged them yet. If we compare this branch against master it's another 5% even faster on average.

To note:

django_queries.txt is a sample trace of queries from a Django web application, so these are realistic production queries. This is a 25% performance win in a real-world scenario and this is a full AST parse benchmark, so we're scrapping 25% performance in the whole parser pipeline just from optimizing the tokenizer.
lobsters.sql.gz is a sample trace of queries from a real Rails application (https://lobste.rs), a Reddit-like site aggregator. I got it running locally and dumped the MySQL logs to get a real-world sample of ActiveRecord-generated SQL -- something we really want to be optimizing for. The result is a 22.24% improvement, almost as good as the Django queries. The reduction on generated garbage is massive however: 4MB (30%) less of heap allocations per benchmark run. This is going to have a significant impact when running Vitess with real-world Rails applications by reducing GC churn and GC times.
sql0 and sql1 are the two classic queries that the sqlparser package has always used for benchmarking. Nothing too interesting about these queries, they're synthetic and yet they get significantly faster.
Parse3/normal and escaped are pathological queries with huge SQL strings embedded in them. normal has plaintext strings and escaped is the worst-case-scenario where every single string has escape characters in them. The results in normal are outrageous because the zero-copy tokenizer is just returning huge chunks of the original input without allocating a single byte. 50% performance increase, and --of course-- a 99% reduction (from 1.81mb to a few bytes) in memory allocations per parse. The escaped case, which is truly pathological, manages to parse massive strings with escape sequences by lazy-copying them, so the result is just as fast as with the old implementation while reducing memory allocations significantly (this is thanks to the strings.Builder trick to skip double copies).

Overall, this is significantly more performance than what I thought would be possible to squeeze just from the tokenizer. I still have to tackle the actual yacc parser next week and can't wait to see what comes out of that.

Thanks to @GuptaManan100 for the original inspiration for this PR! Really happy we can land a version of #6898, particularly with the ergonomic improvements it implies for internal Vitess APIs, which now take strings instead of byte slices.

Related Issue(s)

Faster tokenization by converting byte arrays to strings #6898

Checklist

Should this PR be backported?
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

vmg · 2021-03-05T16:44:50Z

Very confused about this error btw:

Error: go/vt/vtctl/workflow/vexec/query_planner.go:269:44: cannot use ([]byte)(params.DBName) (type []byte) as type string in argument to sqlparser.NewStrLiteral

this file is not on my local checkout of Vitess 👻

rohit-nayak-ps · 2021-03-05T16:55:41Z

Error: go/vt/vtctl/workflow/vexec/query_planner.go:269:44: cannot use ([]byte)(params.DBName) (type []byte) as type string in argument to sqlparser.NewStrLiteral

this file is not on my local checkout of Vitess

You probably need to rebase with the latest master.

GuptaManan100

Thank you for showing me the way senpai 🙇‍♂️

sougou · 2021-03-05T17:24:27Z

This is awesome!

deepthi · 2021-03-05T18:27:40Z

🤯

Signed-off-by: Vicent Marti <[email protected]>

vmg · 2021-03-08T11:44:48Z

Green! Ready for review.

systay · 2021-03-09T15:43:48Z

go/vt/sqlparser/token.go

+	}
+}
+
+func (tkn *Tokenizer) scanStringSlow(buffer *bytes2.Buffer, delim uint16, typ int) (int, []byte) {


Adding a few comments to the methods would be nice. it's not very easy to understand when we need to fallback on the slow methods, for example.

systay

Very nice! I like.

shlomi-noach

Agree that comments are desired for many functions.

Signed-off-by: Vicent Marti <[email protected]>

vmg · 2021-03-10T17:40:36Z

Added comments for all the scanner functions which should make the whole tokenization process more obvious.

Signed-off-by: Vicent Marti <[email protected]>

vmg requested review from deepthi, GuptaManan100, harshit-gangal, rohit-nayak-ps, shlomi-noach and systay as code owners March 5, 2021 16:38

GuptaManan100 approved these changes Mar 5, 2021

View reviewed changes

vmg added 3 commits March 8, 2021 12:15

sqlparser: more benchmarks

c26fc46

Signed-off-by: Vicent Marti <[email protected]>

sqlparser: zero copy tokenizer

c8e3eec

Signed-off-by: Vicent Marti <[email protected]>

sqlparser: zerocopy strings

e921494

Signed-off-by: Vicent Marti <[email protected]>

vmg force-pushed the vmg/zerocopy-token branch from e8adcc8 to e921494 Compare March 8, 2021 11:19

systay reviewed Mar 9, 2021

View reviewed changes

systay approved these changes Mar 9, 2021

View reviewed changes

shlomi-noach reviewed Mar 9, 2021

View reviewed changes

harshit-gangal approved these changes Mar 9, 2021

View reviewed changes

vmg added 2 commits March 10, 2021 18:26

Merge branch 'master' into vmg/zerocopy-token

dc5eef8

token: comment all the scanner functions

6dc29f7

Signed-off-by: Vicent Marti <[email protected]>

planbuilder: fix typecast

d891bd8

Signed-off-by: Vicent Marti <[email protected]>

systay merged commit cf8fc59 into vitessio:master Mar 10, 2021

vmg mentioned this pull request Mar 12, 2021

perf: sqlparser yacc codegen #7669

Merged

8 tasks

deepthi mentioned this pull request Mar 12, 2021

Performance Improvements #7674

Open

askdba added the Component: Query Serving label Mar 18, 2021

askdba added this to the v10.0 milestone Mar 18, 2021

ajm188 mentioned this pull request Jul 15, 2021

slack vitess v10.pre tinyspeck/vitess#228

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: zero-copy tokenizer #7619

perf: zero-copy tokenizer #7619

vmg commented Mar 5, 2021

vmg commented Mar 5, 2021

rohit-nayak-ps commented Mar 5, 2021

GuptaManan100 left a comment

sougou commented Mar 5, 2021

deepthi commented Mar 5, 2021

vmg commented Mar 8, 2021

systay Mar 9, 2021

systay left a comment

shlomi-noach left a comment

vmg commented Mar 10, 2021

perf: zero-copy tokenizer #7619

perf: zero-copy tokenizer #7619

Conversation

vmg commented Mar 5, 2021

Description

Implementation details for this PR

Benchmark tables

Related Issue(s)

Checklist

Deployment Notes

Impacted Areas in Vitess

vmg commented Mar 5, 2021

rohit-nayak-ps commented Mar 5, 2021

GuptaManan100 left a comment

Choose a reason for hiding this comment

sougou commented Mar 5, 2021

deepthi commented Mar 5, 2021

vmg commented Mar 8, 2021

systay Mar 9, 2021

Choose a reason for hiding this comment

systay left a comment

Choose a reason for hiding this comment

shlomi-noach left a comment

Choose a reason for hiding this comment

vmg commented Mar 10, 2021