-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evalengine: virtual machine #12369
evalengine: virtual machine #12369
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
4efaff6
to
5580566
Compare
5580566
to
9b3e7b4
Compare
f7a2ced
to
ae0dd44
Compare
Removing |
8d308d7
to
671a999
Compare
c44fea7
to
6a9928b
Compare
7dc883d
to
e523d83
Compare
8e8e8e2
to
e36bf61
Compare
Signed-off-by: Vicent Marti <[email protected]>
We already had CEIL() and this is very similar, in fact, it's a tiny bit simpler since we don't have to do the add one logic if divmod returns a non zero remainder. Signed-off-by: Dirkjan Bussink <[email protected]>
Since we already have CEIL(), FLOOR() etc. let's add some more numeric operations here to the eval engine. Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Andres Taylor <[email protected]>
Signed-off-by: Andres Taylor <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with the understanding that we'll continue working on this and not enable it until we feel safe doing so
This also fixes how we deal with boolean values so we don't rewrite internal booleans on the stack and update what literal booleans mean accidentally. Signed-off-by: Dirkjan Bussink <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the baseline we have here now is good. We need to further iterate on this, but right now it's not wired up to anything to execute it and it's easier to iterate further in smaller PRs then.
So I think we should merge for now.
Description
In case you missed it, I introduced a new evaluation framework (in #12247) last week -- what could be called
evalengine
v3. That re-engineering had multiple goals, namely:So, after two weeks of preparation work, I'm ready to show my big plan for
evalengine
performance.The goal of this PR is implementing a Virtual Machine that can execute SQL expressions inside of Vitess in a very efficient way.
For those new to programming language design, there are roughly 3 ways to execute a dynamic language at runtime. In increasing level of complexity and performance:
evalengine
works right now!)Now you're probably thinking: does this make sense performance-wise? OK, maybe you're not thinking that. It's a rhetorical question with the aim is explaining the following intuition: SQL expressions are incredibly dynamic (when it comes to typing), very high level (when it comes to each primitive operation), and with very little control flow (when it comes to evaluation -- SQL expressions don't really loop, and conditionals are rare; their flow is always lineal!). This can lead us to believe that there's no performance to be squeezed from translating our AST-based evaluation engine into bytecode. The AST is already well suited for high level operations and type-switching!
This is only superficially true. Lots of programming languages are highly dynamic and they manage to run in bytecode VMs much more efficiently than with an AST interpreter (Ruby's transition from its original AST interpreter in MRI to YARV comes to mind). What's the secret here?
Mostly, the secret is Efficient Interpretation using Quickening (Stefan Brunthaler) and variations of it. The idea is dynamic code is very hard to execute efficiently, and the way to optimize it in practice is rewriting the bytecode from more generic instructions (e.g. a sum operator that needs to figure out the types of the two operands to know how to sum them) into specific static instructions which are specialized for the types they operate on (e.g. a sum operator that knows that both operands are integers).
To do that, a quickening VM needs to figure out at runtime the types of the expressions being evaluated and incrementally rewriting the bytecode into instructions that operate on them. This is hard! But we can take this idea even further, and make it more performant and simpler: as of #12247, our evaluation engine knows how to deterministically type-check any SQL expression based on the types of its input! See where are we going with this? All we need are the fields of the underlying SQL database, and we get to compile any SQL expression into a highly specialized static form which doesn't need to type-switch on any of its arguments.
So that's what we're doing on this PR: we convert
evalengine
AST expressions into statically typed bytecode, which can perform the same evaluation as the AST but knowing all the types ahead of time. Yes, this results in very fast code. Graph time!Raw benchmark data
Here we have a performance comparison of 5 different queries (ranging from very complex to very simple) between three implementations:
evalengine
before evalengine: new evaluation framework #12247 was mergedevalengine
as of right now (yes I did a very good job optimizing the AST evaluator, thanks for the kind feedback!)The results are stark: the pre-compiled SQL expressions when ran in the VM are up to 20x times faster than the original code, and most interestingly: as a side effect of the static typing, evaluating expressions in the VM does not allocate memory.
An efficient and maintainable Virtual Machine in Go
Implementing a VM usually involves a lot of complexity. You have to write a compiler that processes the input expression AST and generates the corresponding binary instructions (you have to come up with an encoding even!) and afterwards you have to implement the actual VM, which decodes each instruction and performs the corresponding operation. And you have to constantly keep these in sync!
Historically, a bytecode VM has always been implemented the same way: a big-ass switch statement. You decode an instruction, and switch on the type to jump to the operation that needs to be performed. This is how (in theory), bytecode VMs beat AST evaluators: because there are no recursive function calls; the program's execution happens linearly via jumps.
This design, however, has always had shortcomings. If you would like an in-depth explanation of these shortcomings, Mike Pall, the author of LuaJIT, elaborates more on this ML post. But allow me to summarize: Besides the fact that the VM's instructions need to be kept in-sync with the compiler, the actual performance of this main VM loop is not great in practice because compilers usually struggle when compiling massive functions. They spill registers all over the place on each arm of the switch, because it's hard to tell which arms are hot and which ones are cold. With all the pushing and popping, the jump into the switch's arm often looks more like a function call!
...And this applies to C compilers, by the way. It's safe to assume that these problems are the same when we implement the VM in Go, and it my testing, they are actually much worse because the Go compiler sucks. For starters, most of the time the different arms of the switch statement are jumped to via binary search instead of a jump table. Switch jump table optimization was implemented last year (https://go-review.googlesource.com/c/go/+/357330), but it is fiddly, and there's no way to enforce it. You have to tweak the way the VM's instructions are encoded carefully to ensure that you're jumping in the VM's main loop.
So, if switch-based VM loops are not the state of the art, what is the state of the art for writing fast interpreters in Go? Well, it turns out that there's nobody doing fast interpreters in Go right now (at least nobody I can find). Most of the dynamic languages I've found implemented in Go have terrible performance. So we must innovate!
The most interesting approach for machines implemented in C or C++ is continuation-style evaluation loops, as seen in this report that implements this technique for parsing Protocol Buffers. This involves implementing all the opcodes for the VM as freestanding functions that operate on the VM, with the return of the function being the next step of the computation. It does sound like something expensive and, huh, recursive, but the trick is that newer versions of LLVM allow us to mark functions as forcefully tail-called (see: https://en.wikipedia.org/wiki/Tail_call), so the resulting code is not recursively calling the VM loop but instead jumping between the operations and using the free-standing functions as an abstraction to control register placement and spillage.
Of course this is not something we can do in Go because, well, the Go compiler is allergic to optimization. It can sometimes emit tail calls, but it needs to be tickled in just the right way, and this is something that we cannot enforce at all in this implementation. This got me thinking: what if we have free-standing functions for each instruction, but instead of tail-calling, we forcefully return control to the evaluation loop after each one? If our compiled bytecode is not bytecode but instead a slice of function pointers to each instruction, this has many appealing properties.
TEXT
SQL object from the input rows into the stack:Both the offset in the input
rows
array and the collation for the text are statically baked into the generated instruction!Next Steps
Wrapping up: This is a novel design for a VM that is both extremely efficient ("almost JIT" thanks to the Go compiler) and very easy to extend and maintain, coupled with an architecture that uses incremental static typing of SQL to generate very efficient compiled programs that run natively in Go and evaluate arbitrary SQL expressions without dynamic memory allocations.
The code in this PR is still an early work in progress: I've only implemented support for all artithmetic operations (a surprising amount there are!) and all comparison operators -- the bare minimum to be able to run meaningful benchmarks with complex SQL expressions and compare them with the old implementations.
My goal is to incrementally update this PR until the expression coverage is good enough that it makes sense to merge it. Since the compiled expressions can be run transparently instead of a normal AST evaluation, I would like to ship the optimizing compiler as an experimental feature in Vitess 17, to stabilize it into full coverage throughout Vitess 18, and then to eventually remove the existing AST evaluator in Vitess 19.
Amibitious? Maybe! I foresee some roadblocks, particularly regarding de-optimization, but I'll talk about those as I get closer to feature completness.
Addendum: JIT Compilation
Inquiring minds may be wondering: what's next? Are we doing JIT compilation next? The answer is no. Although this design for a compiler and VM looks like an exceptional starting point for implementing a full JIT compiler in theory, in practice the trade-off between optimization and complexity doesn't make sense. JIT compilers are important for programming languages where their bytecode operations can be optimized into a very low level of abstraction (e.g. where an "add" operator only has to perform a native x64
ADD
). In these cases, the overhead of dispatching instructions becomes so dominant that replacing the VM's loop with a block of JITted code makes a significant performance difference. However, for SQL expressions, and even after our quickening pass, most of the operations remain extremely high level (things like "match this JSON object with a path" or "add two fixed-width decimals together"). The overhead of instruction dispatch, as measured in these benchmarks, is less than 20% (and can possibly be optimized further in the VM's loop). 20% is not the number you're targetting before you start fucking around with raw assembly for a JIT. So at this point my intuition is that JIT compilation would be a needlessly complex dead optimization.Related Issue(s)
Also implements a number of missing functions from the evalengine listed in #9647
Checklist
Deployment Notes