evalengine: virtual machine #12369

vmg · 2023-02-14T18:24:51Z

Description

In case you missed it, I introduced a new evaluation framework (in #12247) last week -- what could be called evalengine v3. That re-engineering had multiple goals, namely:

Improve the correctness of the way perform evaluations by making them non-lazy
Improve the performance of AST-based evaluation
Allow us to implement custom data types to enable our new JSON implementation (done in evalengine: it's time for JSON! #12274 !)
Finally, enabling a more static type-checking pass of the full SQL AST that bring us closer to compilation (this PR!)

So, after two weeks of preparation work, I'm ready to show my big plan for evalengine performance.

The goal of this PR is implementing a Virtual Machine that can execute SQL expressions inside of Vitess in a very efficient way.

For those new to programming language design, there are roughly 3 ways to execute a dynamic language at runtime. In increasing level of complexity and performance:

An AST-based evaluator, where the syntax of the language is parsed into an AST and evaluation is performed by recursively walking each node of the AST and computing the results. (this is the way the evalengine works right now!)
A bytecode VM, where the AST is compiled into binary bytecode that can be evaluated by a virtual machine -- a piece of code that simulates a CPU, but with higher-level instructions. (this is what we're trying to do here!)
A JIT compiler, in which the bytecode is compiled directly into the host platform's native instructions, so it can be executed directly by the CPU without being interpreted by a Virtual Machine. (we'll talk about this later!)

Now you're probably thinking: does this make sense performance-wise? OK, maybe you're not thinking that. It's a rhetorical question with the aim is explaining the following intuition: SQL expressions are incredibly dynamic (when it comes to typing), very high level (when it comes to each primitive operation), and with very little control flow (when it comes to evaluation -- SQL expressions don't really loop, and conditionals are rare; their flow is always lineal!). This can lead us to believe that there's no performance to be squeezed from translating our AST-based evaluation engine into bytecode. The AST is already well suited for high level operations and type-switching!

This is only superficially true. Lots of programming languages are highly dynamic and they manage to run in bytecode VMs much more efficiently than with an AST interpreter (Ruby's transition from its original AST interpreter in MRI to YARV comes to mind). What's the secret here?

Mostly, the secret is Efficient Interpretation using Quickening (Stefan Brunthaler) and variations of it. The idea is dynamic code is very hard to execute efficiently, and the way to optimize it in practice is rewriting the bytecode from more generic instructions (e.g. a sum operator that needs to figure out the types of the two operands to know how to sum them) into specific static instructions which are specialized for the types they operate on (e.g. a sum operator that knows that both operands are integers).

To do that, a quickening VM needs to figure out at runtime the types of the expressions being evaluated and incrementally rewriting the bytecode into instructions that operate on them. This is hard! But we can take this idea even further, and make it more performant and simpler: as of #12247, our evaluation engine knows how to deterministically type-check any SQL expression based on the types of its input! See where are we going with this? All we need are the fields of the underlying SQL database, and we get to compile any SQL expression into a highly specialized static form which doesn't need to type-switch on any of its arguments.

So that's what we're doing on this PR: we convert evalengine AST expressions into statically typed bytecode, which can perform the same evaluation as the AST but knowing all the types ahead of time. Yes, this results in very fast code. Graph time!

Raw benchmark data

goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/vt/vtgate/evalengine
cpu: AMD Ryzen 7 2700X Eight-Core Processor         
BenchmarkCompilerExpressions/complex_arith/eval=ast-16           1407079               859.3 ns/op            80 B/op          9 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=ast-16           1360402               861.7 ns/op            80 B/op          9 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=ast-16           1367221               874.0 ns/op            80 B/op          9 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=vm-16            7015935               154.5 ns/op             0 B/op          0 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=vm-16            7830362               165.2 ns/op             0 B/op          0 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=vm-16            7685808               150.4 ns/op             0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=ast-16         10571341               123.2 ns/op             8 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=ast-16          9295422               117.6 ns/op             8 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=ast-16         10396491               122.7 ns/op             8 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=vm-16          29770506                36.60 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=vm-16          34390213                35.68 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=vm-16          34351640                60.08 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=ast-16          9401823               126.0 ns/op            16 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=ast-16         10032682               126.6 ns/op            16 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=ast-16          9627816               125.8 ns/op            16 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=vm-16          33149958                35.51 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=vm-16          33283934                34.86 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=vm-16          34757577                35.18 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=ast-16          3904896               319.5 ns/op            64 B/op          3 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=ast-16          3724894               315.4 ns/op            64 B/op          3 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=ast-16          3638520               320.2 ns/op            64 B/op          3 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=vm-16           4912896               230.3 ns/op            40 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=vm-16           4803223               232.7 ns/op            40 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=vm-16           4843416               232.4 ns/op            40 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=ast-16            4004112               290.4 ns/op            16 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=ast-16            3932679               297.2 ns/op            16 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=ast-16            4412119               268.7 ns/op            16 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=vm-16            16343871                63.32 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=vm-16            18287685                61.96 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=vm-16            19428633                66.09 ns/op            0 B/op          0 allocs/op
PASS
ok      vitess.io/vitess/go/vt/vtgate/evalengine        51.304s

Here we have a performance comparison of 5 different queries (ranging from very complex to very simple) between three implementations:

old, which is the original evalengine before evalengine: new evaluation framework #12247 was merged
ast, which is the evalengine as of right now (yes I did a very good job optimizing the AST evaluator, thanks for the kind feedback!)
vm, which is the result of this PR.

The results are stark: the pre-compiled SQL expressions when ran in the VM are up to 20x times faster than the original code, and most interestingly: as a side effect of the static typing, evaluating expressions in the VM does not allocate memory.

An efficient and maintainable Virtual Machine in Go

Implementing a VM usually involves a lot of complexity. You have to write a compiler that processes the input expression AST and generates the corresponding binary instructions (you have to come up with an encoding even!) and afterwards you have to implement the actual VM, which decodes each instruction and performs the corresponding operation. And you have to constantly keep these in sync!

Historically, a bytecode VM has always been implemented the same way: a big-ass switch statement. You decode an instruction, and switch on the type to jump to the operation that needs to be performed. This is how (in theory), bytecode VMs beat AST evaluators: because there are no recursive function calls; the program's execution happens linearly via jumps.

This design, however, has always had shortcomings. If you would like an in-depth explanation of these shortcomings, Mike Pall, the author of LuaJIT, elaborates more on this ML post. But allow me to summarize: Besides the fact that the VM's instructions need to be kept in-sync with the compiler, the actual performance of this main VM loop is not great in practice because compilers usually struggle when compiling massive functions. They spill registers all over the place on each arm of the switch, because it's hard to tell which arms are hot and which ones are cold. With all the pushing and popping, the jump into the switch's arm often looks more like a function call!

...And this applies to C compilers, by the way. It's safe to assume that these problems are the same when we implement the VM in Go, and it my testing, they are actually much worse because the Go compiler sucks. For starters, most of the time the different arms of the switch statement are jumped to via binary search instead of a jump table. Switch jump table optimization was implemented last year (https://go-review.googlesource.com/c/go/+/357330), but it is fiddly, and there's no way to enforce it. You have to tweak the way the VM's instructions are encoded carefully to ensure that you're jumping in the VM's main loop.

So, if switch-based VM loops are not the state of the art, what is the state of the art for writing fast interpreters in Go? Well, it turns out that there's nobody doing fast interpreters in Go right now (at least nobody I can find). Most of the dynamic languages I've found implemented in Go have terrible performance. So we must innovate!

The most interesting approach for machines implemented in C or C++ is continuation-style evaluation loops, as seen in this report that implements this technique for parsing Protocol Buffers. This involves implementing all the opcodes for the VM as freestanding functions that operate on the VM, with the return of the function being the next step of the computation. It does sound like something expensive and, huh, recursive, but the trick is that newer versions of LLVM allow us to mark functions as forcefully tail-called (see: https://en.wikipedia.org/wiki/Tail_call), so the resulting code is not recursively calling the VM loop but instead jumping between the operations and using the free-standing functions as an abstraction to control register placement and spillage.

Of course this is not something we can do in Go because, well, the Go compiler is allergic to optimization. It can sometimes emit tail calls, but it needs to be tickled in just the right way, and this is something that we cannot enforce at all in this implementation. This got me thinking: what if we have free-standing functions for each instruction, but instead of tail-calling, we forcefully return control to the evaluation loop after each one? If our compiled bytecode is not bytecode but instead a slice of function pointers to each instruction, this has many appealing properties.

The VM becomes trivial! It's just 5 lines of code, and it doesn't have to worry about optimizing any large switch statements. It's just repeatedly calling functions one after the other!

func (vm *VirtualMachine) execute(p *Program) (eval, error) {
	code := p.code
	ip := 0

	for ip < len(code) {
		ip += code[ip](vm)
		if vm.err != nil {
			return nil, vm.err
		}
	}
	if vm.sp == 0 {
		return nil, nil
	}
	return vm.stack[vm.sp-1], nil
}

The compiler becomes trivial too, because there is no bytecode! Instead, the compiler emits the individual instructions directly by pushing "callbacks" into a slice. There are no instruction opcodes to keep track off, no encoding to perform and nothing to keep in sync with the VM! Developing the compiler means developing the VM simultaneously!

func (c *compiler) emitPushNull() {
	c.ins = append(c.ins, func(vm *VirtualMachine) int {
		vm.stack[vm.sp] = nil
		vm.sp++
		return 1
	})
}

...but wait, if there's no instruction encoding, then we cannot have instructions with arguments. This is a bit of a showstopper... Except it isn't, because the Go compiler actually supports closures! We can emit any instruction we want and the Go compiler will automatically capture its arguments inside the callback pointer. We don't have to think about how to encode our arguments, and in fact, our arguments can be as complex as they need to be: the resulting callback will contain a copy of them created by the Go compiler. It's essentially a poor man's JIT, and it works amazingly well in practice, both performance-wise and for ergonomics. Check out this compiler method that generates an instruction to push a TEXT SQL object from the input rows into the stack:

func (c *compiler) emitPushColumn_text(offset int, col collations.TypedCollation) {
	c.ins = append(c.ins, func(vm *VirtualMachine) int {
		vm.stack[vm.sp] = newEvalText(vm.row[offset].Raw(), col)
		vm.sp++
		return 1
	})
}

Both the offset in the input rows array and the collation for the text are statically baked into the generated instruction!

Next Steps

Wrapping up: This is a novel design for a VM that is both extremely efficient ("almost JIT" thanks to the Go compiler) and very easy to extend and maintain, coupled with an architecture that uses incremental static typing of SQL to generate very efficient compiled programs that run natively in Go and evaluate arbitrary SQL expressions without dynamic memory allocations.

The code in this PR is still an early work in progress: I've only implemented support for all artithmetic operations (a surprising amount there are!) and all comparison operators -- the bare minimum to be able to run meaningful benchmarks with complex SQL expressions and compare them with the old implementations.

My goal is to incrementally update this PR until the expression coverage is good enough that it makes sense to merge it. Since the compiled expressions can be run transparently instead of a normal AST evaluation, I would like to ship the optimizing compiler as an experimental feature in Vitess 17, to stabilize it into full coverage throughout Vitess 18, and then to eventually remove the existing AST evaluator in Vitess 19.

Amibitious? Maybe! I foresee some roadblocks, particularly regarding de-optimization, but I'll talk about those as I get closer to feature completness.

Addendum: JIT Compilation

Inquiring minds may be wondering: what's next? Are we doing JIT compilation next? The answer is no. Although this design for a compiler and VM looks like an exceptional starting point for implementing a full JIT compiler in theory, in practice the trade-off between optimization and complexity doesn't make sense. JIT compilers are important for programming languages where their bytecode operations can be optimized into a very low level of abstraction (e.g. where an "add" operator only has to perform a native x64 ADD). In these cases, the overhead of dispatching instructions becomes so dominant that replacing the VM's loop with a block of JITted code makes a significant performance difference. However, for SQL expressions, and even after our quickening pass, most of the operations remain extremely high level (things like "match this JSON object with a path" or "add two fixed-width decimals together"). The overhead of instruction dispatch, as measured in these benchmarks, is less than 20% (and can possibly be optimized further in the VM's loop). 20% is not the number you're targetting before you start fucking around with raw assembly for a JIT. So at this point my intuition is that JIT compilation would be a needlessly complex dead optimization.

Related Issue(s)

Also implements a number of missing functions from the evalengine listed in #9647

Checklist

"Backport to:" labels have been added if this change should be back-ported
Tests were added or are not required
Documentation was added or is not required

Deployment Notes

vitess-bot · 2023-02-14T18:24:54Z

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

Ensure that the Pull Request has a descriptive title.
If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

Is it really necessary to add this flag?
Flag names should be clear and intuitive (as far as possible)
Help text should be descriptive.
Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

Each item in Jobs should be named in order to mark it as required.
If the workflow should be required, the maintainer team should be notified.

Bug fixes

There should be at least one unit or end-to-end test.
The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

There should be some code comments as to why things are implemented the way they are.

New/Existing features

Should be documented, either by modifying the existing documentation or creating new documentation.
New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

Protobuf changes should be wire-compatible.
Changes to _vt tables and RPCs need to be backward compatible.
vtctl command output order should be stable and awk-able.
RPC changes should be compatible with vitess-operator
If a flag is removed, then it should also be removed from VTop, if used there.

go/vt/vtgate/evalengine/arena.go

go/vt/vtgate/evalengine/arithmetic.go

go/vt/vtgate/evalengine/expr_compare.go

go/vt/vtgate/evalengine/vm.go

GuptaManan100 · 2023-02-28T15:51:51Z

Removing Benchmark me label because we want to run the benchmarks for v16.0.0. Please add it again when you want to run benchmarks.

Signed-off-by: Vicent Marti <[email protected]>

We already had CEIL() and this is very similar, in fact, it's a tiny bit simpler since we don't have to do the add one logic if divmod returns a non zero remainder. Signed-off-by: Dirkjan Bussink <[email protected]>

Since we already have CEIL(), FLOOR() etc. let's add some more numeric operations here to the eval engine. Signed-off-by: Dirkjan Bussink <[email protected]>

Signed-off-by: Dirkjan Bussink <[email protected]>

Signed-off-by: Vicent Marti <[email protected]>

Signed-off-by: Dirkjan Bussink <[email protected]>

go/vt/vtgate/evalengine/compiler_asm.go

Signed-off-by: Andres Taylor <[email protected]>

go/vt/vtgate/evalengine/compiler_compare.go

Signed-off-by: Andres Taylor <[email protected]>

Signed-off-by: Vicent Marti <[email protected]>

systay

Approving with the understanding that we'll continue working on this and not enable it until we feel safe doing so

This also fixes how we deal with boolean values so we don't rewrite internal booleans on the stack and update what literal booleans mean accidentally. Signed-off-by: Dirkjan Bussink <[email protected]>

dbussink

I think the baseline we have here now is good. We need to further iterate on this, but right now it's not wired up to anything to execute it and it's easier to iterate further in smaller PRs then.

So I think we should merge for now.

vmg added Component: Evalengine changes to the evaluation engine Skip CI Skip CI actions from running labels Feb 14, 2023

vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 14, 2023

harshit-gangal added the Benchmark me Add label to PR to run benchmarks label Feb 14, 2023

vmg force-pushed the vmg/eval-static branch 2 times, most recently from 4efaff6 to 5580566 Compare February 28, 2023 09:07

vmg removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says Skip CI Skip CI actions from running labels Feb 28, 2023

dbussink reviewed Feb 28, 2023

View reviewed changes

dbussink force-pushed the vmg/eval-static branch from 5580566 to 9b3e7b4 Compare February 28, 2023 10:49

dbussink added the Skip CI Skip CI actions from running label Feb 28, 2023

dbussink force-pushed the vmg/eval-static branch from f7a2ced to ae0dd44 Compare February 28, 2023 15:17

GuptaManan100 removed the Benchmark me Add label to PR to run benchmarks label Feb 28, 2023

dbussink force-pushed the vmg/eval-static branch 3 times, most recently from 8d308d7 to 671a999 Compare March 3, 2023 14:05

vmg mentioned this pull request Mar 6, 2023

evalengine: Refactorings & fixes #12554

Merged

4 tasks

vmg force-pushed the vmg/eval-static branch 3 times, most recently from c44fea7 to 6a9928b Compare March 6, 2023 14:20

dbussink force-pushed the vmg/eval-static branch from 7dc883d to e523d83 Compare March 6, 2023 14:25

vmg force-pushed the vmg/eval-static branch 4 times, most recently from 8e8e8e2 to e36bf61 Compare March 9, 2023 15:52

vmg removed the Skip CI Skip CI actions from running label Mar 13, 2023

vmg and others added 10 commits March 16, 2023 14:47

evalengine/compiler: test for runtime errors

85b97bf

Signed-off-by: Vicent Marti <[email protected]>

evalengine: Add support for FLOOR().

35ad0fe

We already had CEIL() and this is very similar, in fact, it's a tiny bit simpler since we don't have to do the add one logic if divmod returns a non zero remainder. Signed-off-by: Dirkjan Bussink <[email protected]>

evalengine: Add support for ABS()

35a9998

Since we already have CEIL(), FLOOR() etc. let's add some more numeric operations here to the eval engine. Signed-off-by: Dirkjan Bussink <[email protected]>

evalengine: Implement trigonometry functions

d71e051

Signed-off-by: Dirkjan Bussink <[email protected]>

evalengine: simplify test case generation

7cd35c2

Signed-off-by: Vicent Marti <[email protected]>

evalengine: add missing headers

07d3003

Signed-off-by: Vicent Marti <[email protected]>

evalengine: sizegen

9ff967c

Signed-off-by: Vicent Marti <[email protected]>

slices2: update licensing

26871f5

Signed-off-by: Vicent Marti <[email protected]>

evalengine: add documentation

1bc5a96

Signed-off-by: Vicent Marti <[email protected]>

Move normalization logic as it's test only

5efd9f0

Signed-off-by: Dirkjan Bussink <[email protected]>

vmg force-pushed the vmg/eval-static branch from 96877c7 to 5efd9f0 Compare March 16, 2023 13:47

dbussink added 2 commits March 16, 2023 14:49

evalengine/compiler: Add missing COT to compiler.

e0fca5e

Signed-off-by: Dirkjan Bussink <[email protected]>

evalengine/compiler: Fix flags for arithmatic operations

310d572

Signed-off-by: Dirkjan Bussink <[email protected]>

systay reviewed Mar 16, 2023

View reviewed changes

go/vt/vtgate/evalengine/compiler_asm.go Show resolved Hide resolved

refactor comparisons

ffaa737

Signed-off-by: Andres Taylor <[email protected]>

vmg commented Mar 16, 2023

View reviewed changes

go/vt/vtgate/evalengine/compiler_compare.go Outdated Show resolved Hide resolved

systay and others added 3 commits March 16, 2023 15:53

no need to pass in err to method

96c5759

Signed-off-by: Andres Taylor <[email protected]>

evalengine/compiler: remove redundant constants

44328b9

Signed-off-by: Vicent Marti <[email protected]>

evalengine/compiler: de-duplicate math functions

10ab0a1

Signed-off-by: Vicent Marti <[email protected]>

vmg requested a review from mattlord as a code owner March 16, 2023 15:46

systay approved these changes Mar 16, 2023

View reviewed changes

evalengine: Add additional arithmetic tests and fix edge cases

45f6497

This also fixes how we deal with boolean values so we don't rewrite internal booleans on the stack and update what literal booleans mean accidentally. Signed-off-by: Dirkjan Bussink <[email protected]>

dbussink approved these changes Mar 16, 2023

View reviewed changes

dbussink merged commit 75d1240 into vitessio:main Mar 16, 2023

dbussink deleted the vmg/eval-static branch March 16, 2023 19:47

dbussink mentioned this pull request Mar 18, 2023

evalengine: Implement integer division and modulo #12656

Merged

4 tasks

systay mentioned this pull request Mar 23, 2023

[ast & semantics] Make more type information available #12710

Merged

This was referenced Mar 27, 2023

Implement logic to handle more JSON conversion #12733

Merged

evalengine: Fix handling of datetime and numeric comparisons #12789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evalengine: virtual machine #12369

evalengine: virtual machine #12369

vmg commented Feb 14, 2023 •

edited by dbussink

Loading

vitess-bot bot commented Feb 14, 2023

GuptaManan100 commented Feb 28, 2023

systay left a comment

dbussink left a comment

evalengine: virtual machine #12369

evalengine: virtual machine #12369

Conversation

vmg commented Feb 14, 2023 • edited by dbussink Loading

Description

An efficient and maintainable Virtual Machine in Go

Next Steps

Addendum: JIT Compilation

Related Issue(s)

Checklist

Deployment Notes

vitess-bot bot commented Feb 14, 2023

Review Checklist

General

If a new flag is being introduced:

If a workflow is added or modified:

Bug fixes

Non-trivial changes

New/Existing features

Backward compatibility

GuptaManan100 commented Feb 28, 2023

systay left a comment

Choose a reason for hiding this comment

dbussink left a comment

Choose a reason for hiding this comment

vmg commented Feb 14, 2023 •

edited by dbussink

Loading