-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: sqlparser faster formatting #7710
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
vmg
requested review from
GuptaManan100,
harshit-gangal and
systay
as code owners
March 18, 2021 15:35
systay
reviewed
Mar 18, 2021
systay
reviewed
Mar 18, 2021
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
harshit-gangal
approved these changes
Mar 18, 2021
Signed-off-by: Vicent Marti <[email protected]>
systay
approved these changes
Mar 22, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. This is really, really cool.
The only thing I'm missing is a verify mode that we can use as a commit hook, but we can add that as an issue and someone else can work on that, if that saves you time & focus.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Happy Friday! (national holiday tomorrow so it's Friday for me right now).
This week I'm bringing another important optimization to the
sqlparser
code. We're tackling the performance ofTrackedBuffer
, the data structure that lets us format SQL ASTs into their textual representation.The existing implementation for formatting AST nodes implements a
Format(buf *TrackedBuffer)
method on every AST nodes. Inside of these methods, the node serializes itself into theTrackedBuffer
by using a very helpfulTrackedBuffer.astPrintf
method. This lets us developers implement the formatting of all nodes in a very convenient way, because we can use printf-like syntax to generate the SQL output, but it has terrible performance implications.astPrintf
allocate; a printf interface likefunc (buf *TrackedBuffer) astPrintf(currentNode SQLNode, format string, values ...interface{})
must necessarily use variable arguments.varargs
are not cheap in Go, because they must be passed asinterface{}
, and we already know (or we learnt last week) that moving objects intointerface{}
allocates in most cases.astPrintf
must perform parsing of the input string, which is not free. This is not ideal because for a given node, it's format string is always the same and doesn't change between calls, andastPrintf
lose type information: when we callastPrintf
from one of ourSQLNode
structs, we obviously know the type of our node, and we know if this kind of node would need special semantic handling (e.g. whether it requires being wrapped with parens) when serializing it. When we pass it through aninterface{}
, this information is lost, and we need to typecast the interface to figure out this semantic information -- we're doing that for every singleastPrintf
call.astPrintf
cannot be inlined. AST serialization is a highly recursive operation, and there's a very significant amount of performance to be gained by inlining the recursive calls when serializing the fields of any given node. The Go compiler cannot inline throughinterface{}
callsites.So, how do we fix all this? These are all problems that could be trivially solved by verbosely and manually removing all calls to
astPrintf
and just writing the code to write directly into theTrackedBuffer
, without any format strings. This results in very fast code, but theFormat
functions for our SQL nodes become essentially a maintenance nightmare; the printf syntax is very convenient to make this code manageable.Because of this, I've come up with an alternative solution: a code rewriter that picks up all the formatting code for SQL nodes, finds all
astPrintf
calls, parses their static printf format strings, and statically replaces them with their decomposed forms. The resulting code is written into a separate method for every SQL node:Before rewrite:
After rewrite:
This is not a naive regexp replacement (I tried implementing that at first; didn't work out), it's a fully syntax and type aware rewriting which handles statically at compile time many of the calculations that
astPrintf
was doing at runtime, particularly when it comes to handling expression performance and grouping. The rewriter knows whether any of the fields in a node can contain expressions, and whether the expressions need special handling based on syntactic precedence:The resulting
formatFast
code is used transparently by default when users create aTrackedBuffer
without a custom formatter; the existingFormat
code is used when users use a custom formatter callback (the fast formatter doesn't support custom callbacks because they prevent inlining), so the system is fully backwards compatible.The results are very exciting. Inlining and removing allocations is a very significant optimization in Go:
More than twice as fast for our dataset of realistic queries. A normal
vtgate
query formats incoming queries into strings at least once (often several times), so I expect this to have a measurable impact in total query latency.Related Issue(s)
Checklist
Deployment Notes
Impacted Areas in Vitess
Components that this PR will affect: