Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: zero-copy tokenizer #7619

Merged
merged 6 commits into from
Mar 10, 2021
Merged

perf: zero-copy tokenizer #7619

merged 6 commits into from
Mar 10, 2021

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented Mar 5, 2021

☠️ ⚠️ WARNING: This is a very spicy PR. Please acquire a glass of milk 🥛 before reading ⚠️ ☠️

Description

Alright, here's the spiciest thing I've managed to do to the sqlparser this week. This is a revival of @GuptaManan100's #6898, which didn't quite work out and wasn't merged at the end. The goal of the original PR was to make the sqlparser.Tokenizer work on strings, as opposed to []byte slices, in order to optimize the input to the parser system. Since the input SQL queries in Vitess are always stored in strings, and our tokenizer was operating in []byte, creating a tokenizer means copying the whole string before each parse (for those following at home: this is because string in Go is immutable; if we convert a string to a []byte, they cannot share the same underlying storage because the new slice is actually mutable, and writing to it would change the contents of the source string).

Changing the whole system to use string was an interesting optimization because it removed the copy of the input queries, but when benchmarked, the resulting parser was actually slower in all cases. Why did the optimization fail? The reason is because the original PR was adding more byte-copies than it was removing. The parser implementation as it is right now is not zero-copy. The Lexer creates small buffers for every single token that needs to be passed to the Parser, and as the bytes of the SQL query are processed, they're added to this temporary buffer and once a token is completed in the buffer, the buffer is returned.

This is what caused the massive slowdown in #6898: even though now we were not performing the initial copy when creating a Tokenizer, every single token yielded by the tokenizer meant creating a temporary []byte buffer to push the token and then creating another extra copy of the temporary buffer to return it as a string! The original parser didn't have to create the extra copy of each token because it was returning the temporary buffer directly, but if we want to return strings, we need two copies per token instead of one! Oooh! So that's why we didn't see a performance win!

The original #6898 attempts to work around this double-copy issue by implementing a copy-on-write system on the Tokenizer. The idea is, essentially, reusing the temporary buffers as much as possible, so we're amortizing the cost of allocating the bytes, even though we still have to create copies of the buffer to turn them into string. This is a good effort, but exceedingly hard to get right, so the results in the benchmarks were inconclusive: some of the parser benchmarks were made slower and some others were made faster, depending on the patterns of buffer reuse.

Implementation details for this PR

So, what is the right way to implement this optimization? In order for string tokenization to be worth it at all, we need a 1-2 punch, which is what this PR attempts. 927487a rewrites the tokenizer so it becomes zero-copy even though it still uses []byte everywhere. Anything that is actually creating temporary buffers to store tokens in them is going to result in a slowdown when we switch to string-based tokenization, so we need to get rid of buffers everywhere. It turns out this is doable, although it requires a slight rewrite of the tokenizer. With some classical compiler theory design (i.e. a byte tokenizer that only peeks and skips), we can tokenize any SQL by returning slices off the original input byte slice. These slices do not allocate because their underlying storage is the same as the the input SQL buffer. The only exception to this? Strings and identifiers that contain escape sequences in them. These escape sequences need to be unescaped before being yielded by the tokenizer, but this is the exceptional case: we can design the tokenizer so that by default SQL "strings" are returned directly from the input buffer, and we only fall back to allocating a temporary buffer the first time we find a escape character in the given string.

It turns out that once the tokenizer is re-implemented to be zero-copy, this optimization is more impactful than actually switching to string-based tokenization. In fact, the string tokenization is the cherry on top, and it becomes trivial to implement once the tokenizer has been made zero-copy, giving another small performance increase for free, and IMO much cleaner APIs.

The only tricky thing about the port to using strings for tokenization is the temporary buffers for escaped strings. The original tokenizer, and the zero-copy bytes tokenizer use the efficient bytes2.Buffer implementation to allocate these temporary buffers, but when switching to string-based tokenization, it's important to switch to the standard library's strings.Builder, as it does a very important optimization for us: since the underlying temporary buffer for the builder will never be accessed directly, it can cast its underlying storage to a string using unsafe code, as opposed to copying it safely, giving us a significant reduction in allocations in pathological queries where all SQL strings contain escape sequences.

Benchmark tables

Alright alright and now for the moment you've been waiting for:

name                               old time/op    new time/op    delta
ParseTraces/django_queries.txt-16    3.27ms ± 4%    2.47ms ± 2%  -24.54%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16        137ms ± 3%     107ms ± 2%  -22.24%  (p=0.008 n=5+5)
ParseStress/sql0-16                  16.2µs ± 2%    13.0µs ± 2%  -19.70%  (p=0.008 n=5+5)
ParseStress/sql1-16                  51.1µs ± 1%    40.8µs ± 2%  -20.17%  (p=0.008 n=5+5)
Parse3/normal-16                     3.47ms ± 3%    1.74ms ± 2%  -49.67%  (p=0.008 n=5+5)
Parse3/escaped-16                    5.79ms ± 1%    5.83ms ± 1%     ~     (p=0.222 n=5+5)

name                               old alloc/op   new alloc/op   delta
ParseTraces/django_queries.txt-16     325kB ± 0%     211kB ± 0%  -35.07%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16       13.6MB ± 0%     9.6MB ± 0%  -29.33%  (p=0.008 n=5+5)
ParseStress/sql0-16                  1.65kB ± 0%    1.43kB ± 0%     ~     (p=0.079 n=4+5)
ParseStress/sql1-16                  5.57kB ± 0%    4.78kB ± 0%  -14.22%  (p=0.008 n=5+5)
Parse3/normal-16                     1.81MB ± 0%    0.00MB ± 0%  -99.83%  (p=0.008 n=5+5)
Parse3/escaped-16                    6.36MB ± 0%    5.23MB ± 0%  -17.80%  (p=0.008 n=5+5)

name                               old allocs/op  new allocs/op  delta
ParseTraces/django_queries.txt-16     8.80k ± 0%     3.28k ± 0%  -62.75%  (p=0.008 n=5+5)
ParseTraces/lobsters.sql.gz-16         364k ± 0%      154k ± 0%     ~     (p=0.079 n=4+5)
ParseStress/sql0-16                    41.0 ± 0%      24.0 ± 0%  -41.46%  (p=0.008 n=5+5)
ParseStress/sql1-16                     137 ± 0%        69 ± 0%  -49.64%  (p=0.008 n=5+5)
Parse3/normal-16                        112 ± 0%        52 ± 0%  -53.57%  (p=0.008 n=5+5)
Parse3/escaped-16                       336 ± 0%       294 ± 0%  -12.40%  (p=0.008 n=5+5)

yeeeah now that's some good shit. We're looking at 6 benchmarks, which I've improved for this particular PR. The baseline already includes the perfect table lookups PR from yesterday even though I haven't merged them yet. If we compare this branch against master it's another 5% even faster on average.

To note:

  • django_queries.txt is a sample trace of queries from a Django web application, so these are realistic production queries. This is a 25% performance win in a real-world scenario and this is a full AST parse benchmark, so we're scrapping 25% performance in the whole parser pipeline just from optimizing the tokenizer.
  • lobsters.sql.gz is a sample trace of queries from a real Rails application (https://lobste.rs), a Reddit-like site aggregator. I got it running locally and dumped the MySQL logs to get a real-world sample of ActiveRecord-generated SQL -- something we really want to be optimizing for. The result is a 22.24% improvement, almost as good as the Django queries. The reduction on generated garbage is massive however: 4MB (30%) less of heap allocations per benchmark run. This is going to have a significant impact when running Vitess with real-world Rails applications by reducing GC churn and GC times.
  • sql0 and sql1 are the two classic queries that the sqlparser package has always used for benchmarking. Nothing too interesting about these queries, they're synthetic and yet they get significantly faster.
  • Parse3/normal and escaped are pathological queries with huge SQL strings embedded in them. normal has plaintext strings and escaped is the worst-case-scenario where every single string has escape characters in them. The results in normal are outrageous because the zero-copy tokenizer is just returning huge chunks of the original input without allocating a single byte. 50% performance increase, and --of course-- a 99% reduction (from 1.81mb to a few bytes) in memory allocations per parse. The escaped case, which is truly pathological, manages to parse massive strings with escape sequences by lazy-copying them, so the result is just as fast as with the old implementation while reducing memory allocations significantly (this is thanks to the strings.Builder trick to skip double copies).

Overall, this is significantly more performance than what I thought would be possible to squeeze just from the tokenizer. I still have to tackle the actual yacc parser next week and can't wait to see what comes out of that.

Thanks to @GuptaManan100 for the original inspiration for this PR! Really happy we can land a version of #6898, particularly with the ergonomic improvements it implies for internal Vitess APIs, which now take strings instead of byte slices.

Related Issue(s)

Checklist

  • Should this PR be backported?
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

@vmg
Copy link
Collaborator Author

vmg commented Mar 5, 2021

Very confused about this error btw:

Error: go/vt/vtctl/workflow/vexec/query_planner.go:269:44: cannot use ([]byte)(params.DBName) (type []byte) as type string in argument to sqlparser.NewStrLiteral

this file is not on my local checkout of Vitess 👻

@rohit-nayak-ps
Copy link
Contributor

Error: go/vt/vtctl/workflow/vexec/query_planner.go:269:44: cannot use ([]byte)(params.DBName) (type []byte) as type string in argument to sqlparser.NewStrLiteral

this file is not on my local checkout of Vitess

You probably need to rebase with the latest master.

Copy link
Member

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for showing me the way senpai 🙇‍♂️

@sougou
Copy link
Contributor

sougou commented Mar 5, 2021

This is awesome!

@deepthi
Copy link
Member

deepthi commented Mar 5, 2021

🤯

vmg added 3 commits March 8, 2021 12:15
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
@vmg vmg force-pushed the vmg/zerocopy-token branch from e8adcc8 to e921494 Compare March 8, 2021 11:19
@vmg
Copy link
Collaborator Author

vmg commented Mar 8, 2021

Green! Ready for review.

}
}

func (tkn *Tokenizer) scanStringSlow(buffer *bytes2.Buffer, delim uint16, typ int) (int, []byte) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a few comments to the methods would be nice. it's not very easy to understand when we need to fallback on the slow methods, for example.

Copy link
Collaborator

@systay systay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I like.

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that comments are desired for many functions.

@vmg
Copy link
Collaborator Author

vmg commented Mar 10, 2021

Added comments for all the scanner functions which should make the whole tokenization process more obvious.

Signed-off-by: Vicent Marti <[email protected]>
@systay systay merged commit cf8fc59 into vitessio:master Mar 10, 2021
@vmg vmg mentioned this pull request Mar 12, 2021
8 tasks
@askdba askdba added this to the v10.0 milestone Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants