Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evalengine: virtual machine #12369

Merged
merged 42 commits into from
Mar 16, 2023
Merged

Conversation

vmg
Copy link
Collaborator

@vmg vmg commented Feb 14, 2023

Description

In case you missed it, I introduced a new evaluation framework (in #12247) last week -- what could be called evalengine v3. That re-engineering had multiple goals, namely:

  • Improve the correctness of the way perform evaluations by making them non-lazy
  • Improve the performance of AST-based evaluation
  • Allow us to implement custom data types to enable our new JSON implementation (done in evalengine: it's time for JSON! #12274 !)
  • Finally, enabling a more static type-checking pass of the full SQL AST that bring us closer to compilation (this PR!)

So, after two weeks of preparation work, I'm ready to show my big plan for evalengine performance.

The goal of this PR is implementing a Virtual Machine that can execute SQL expressions inside of Vitess in a very efficient way.

For those new to programming language design, there are roughly 3 ways to execute a dynamic language at runtime. In increasing level of complexity and performance:

  1. An AST-based evaluator, where the syntax of the language is parsed into an AST and evaluation is performed by recursively walking each node of the AST and computing the results. (this is the way the evalengine works right now!)
  2. A bytecode VM, where the AST is compiled into binary bytecode that can be evaluated by a virtual machine -- a piece of code that simulates a CPU, but with higher-level instructions. (this is what we're trying to do here!)
  3. A JIT compiler, in which the bytecode is compiled directly into the host platform's native instructions, so it can be executed directly by the CPU without being interpreted by a Virtual Machine. (we'll talk about this later!)

Now you're probably thinking: does this make sense performance-wise? OK, maybe you're not thinking that. It's a rhetorical question with the aim is explaining the following intuition: SQL expressions are incredibly dynamic (when it comes to typing), very high level (when it comes to each primitive operation), and with very little control flow (when it comes to evaluation -- SQL expressions don't really loop, and conditionals are rare; their flow is always lineal!). This can lead us to believe that there's no performance to be squeezed from translating our AST-based evaluation engine into bytecode. The AST is already well suited for high level operations and type-switching!

This is only superficially true. Lots of programming languages are highly dynamic and they manage to run in bytecode VMs much more efficiently than with an AST interpreter (Ruby's transition from its original AST interpreter in MRI to YARV comes to mind). What's the secret here?

Mostly, the secret is Efficient Interpretation using Quickening (Stefan Brunthaler) and variations of it. The idea is dynamic code is very hard to execute efficiently, and the way to optimize it in practice is rewriting the bytecode from more generic instructions (e.g. a sum operator that needs to figure out the types of the two operands to know how to sum them) into specific static instructions which are specialized for the types they operate on (e.g. a sum operator that knows that both operands are integers).

To do that, a quickening VM needs to figure out at runtime the types of the expressions being evaluated and incrementally rewriting the bytecode into instructions that operate on them. This is hard! But we can take this idea even further, and make it more performant and simpler: as of #12247, our evaluation engine knows how to deterministically type-check any SQL expression based on the types of its input! See where are we going with this? All we need are the fields of the underlying SQL database, and we get to compile any SQL expression into a highly specialized static form which doesn't need to type-switch on any of its arguments.

So that's what we're doing on this PR: we convert evalengine AST expressions into statically typed bytecode, which can perform the same evaluation as the AST but knowing all the types ahead of time. Yes, this results in very fast code. Graph time!

image

Raw benchmark data
goos: linux
goarch: amd64
pkg: vitess.io/vitess/go/vt/vtgate/evalengine
cpu: AMD Ryzen 7 2700X Eight-Core Processor         
BenchmarkCompilerExpressions/complex_arith/eval=ast-16           1407079               859.3 ns/op            80 B/op          9 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=ast-16           1360402               861.7 ns/op            80 B/op          9 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=ast-16           1367221               874.0 ns/op            80 B/op          9 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=vm-16            7015935               154.5 ns/op             0 B/op          0 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=vm-16            7830362               165.2 ns/op             0 B/op          0 allocs/op
BenchmarkCompilerExpressions/complex_arith/eval=vm-16            7685808               150.4 ns/op             0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=ast-16         10571341               123.2 ns/op             8 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=ast-16          9295422               117.6 ns/op             8 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=ast-16         10396491               122.7 ns/op             8 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=vm-16          29770506                36.60 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=vm-16          34390213                35.68 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_i64/eval=vm-16          34351640                60.08 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=ast-16          9401823               126.0 ns/op            16 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=ast-16         10032682               126.6 ns/op            16 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=ast-16          9627816               125.8 ns/op            16 B/op          1 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=vm-16          33149958                35.51 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=vm-16          33283934                34.86 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_u64/eval=vm-16          34757577                35.18 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=ast-16          3904896               319.5 ns/op            64 B/op          3 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=ast-16          3724894               315.4 ns/op            64 B/op          3 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=ast-16          3638520               320.2 ns/op            64 B/op          3 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=vm-16           4912896               230.3 ns/op            40 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=vm-16           4803223               232.7 ns/op            40 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_dec/eval=vm-16           4843416               232.4 ns/op            40 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=ast-16            4004112               290.4 ns/op            16 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=ast-16            3932679               297.2 ns/op            16 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=ast-16            4412119               268.7 ns/op            16 B/op          2 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=vm-16            16343871                63.32 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=vm-16            18287685                61.96 ns/op            0 B/op          0 allocs/op
BenchmarkCompilerExpressions/comparison_f/eval=vm-16            19428633                66.09 ns/op            0 B/op          0 allocs/op
PASS
ok      vitess.io/vitess/go/vt/vtgate/evalengine        51.304s

Here we have a performance comparison of 5 different queries (ranging from very complex to very simple) between three implementations:

  1. old, which is the original evalengine before evalengine: new evaluation framework #12247 was merged
  2. ast, which is the evalengine as of right now (yes I did a very good job optimizing the AST evaluator, thanks for the kind feedback!)
  3. vm, which is the result of this PR.

The results are stark: the pre-compiled SQL expressions when ran in the VM are up to 20x times faster than the original code, and most interestingly: as a side effect of the static typing, evaluating expressions in the VM does not allocate memory.

An efficient and maintainable Virtual Machine in Go

Implementing a VM usually involves a lot of complexity. You have to write a compiler that processes the input expression AST and generates the corresponding binary instructions (you have to come up with an encoding even!) and afterwards you have to implement the actual VM, which decodes each instruction and performs the corresponding operation. And you have to constantly keep these in sync!

Historically, a bytecode VM has always been implemented the same way: a big-ass switch statement. You decode an instruction, and switch on the type to jump to the operation that needs to be performed. This is how (in theory), bytecode VMs beat AST evaluators: because there are no recursive function calls; the program's execution happens linearly via jumps.

This design, however, has always had shortcomings. If you would like an in-depth explanation of these shortcomings, Mike Pall, the author of LuaJIT, elaborates more on this ML post. But allow me to summarize: Besides the fact that the VM's instructions need to be kept in-sync with the compiler, the actual performance of this main VM loop is not great in practice because compilers usually struggle when compiling massive functions. They spill registers all over the place on each arm of the switch, because it's hard to tell which arms are hot and which ones are cold. With all the pushing and popping, the jump into the switch's arm often looks more like a function call!

...And this applies to C compilers, by the way. It's safe to assume that these problems are the same when we implement the VM in Go, and it my testing, they are actually much worse because the Go compiler sucks. For starters, most of the time the different arms of the switch statement are jumped to via binary search instead of a jump table. Switch jump table optimization was implemented last year (https://go-review.googlesource.com/c/go/+/357330), but it is fiddly, and there's no way to enforce it. You have to tweak the way the VM's instructions are encoded carefully to ensure that you're jumping in the VM's main loop.

So, if switch-based VM loops are not the state of the art, what is the state of the art for writing fast interpreters in Go? Well, it turns out that there's nobody doing fast interpreters in Go right now (at least nobody I can find). Most of the dynamic languages I've found implemented in Go have terrible performance. So we must innovate!

The most interesting approach for machines implemented in C or C++ is continuation-style evaluation loops, as seen in this report that implements this technique for parsing Protocol Buffers. This involves implementing all the opcodes for the VM as freestanding functions that operate on the VM, with the return of the function being the next step of the computation. It does sound like something expensive and, huh, recursive, but the trick is that newer versions of LLVM allow us to mark functions as forcefully tail-called (see: https://en.wikipedia.org/wiki/Tail_call), so the resulting code is not recursively calling the VM loop but instead jumping between the operations and using the free-standing functions as an abstraction to control register placement and spillage.

Of course this is not something we can do in Go because, well, the Go compiler is allergic to optimization. It can sometimes emit tail calls, but it needs to be tickled in just the right way, and this is something that we cannot enforce at all in this implementation. This got me thinking: what if we have free-standing functions for each instruction, but instead of tail-calling, we forcefully return control to the evaluation loop after each one? If our compiled bytecode is not bytecode but instead a slice of function pointers to each instruction, this has many appealing properties.

  1. The VM becomes trivial! It's just 5 lines of code, and it doesn't have to worry about optimizing any large switch statements. It's just repeatedly calling functions one after the other!
func (vm *VirtualMachine) execute(p *Program) (eval, error) {
	code := p.code
	ip := 0

	for ip < len(code) {
		ip += code[ip](vm)
		if vm.err != nil {
			return nil, vm.err
		}
	}
	if vm.sp == 0 {
		return nil, nil
	}
	return vm.stack[vm.sp-1], nil
}
  1. The compiler becomes trivial too, because there is no bytecode! Instead, the compiler emits the individual instructions directly by pushing "callbacks" into a slice. There are no instruction opcodes to keep track off, no encoding to perform and nothing to keep in sync with the VM! Developing the compiler means developing the VM simultaneously!
func (c *compiler) emitPushNull() {
	c.ins = append(c.ins, func(vm *VirtualMachine) int {
		vm.stack[vm.sp] = nil
		vm.sp++
		return 1
	})
}
  1. ...but wait, if there's no instruction encoding, then we cannot have instructions with arguments. This is a bit of a showstopper... Except it isn't, because the Go compiler actually supports closures! We can emit any instruction we want and the Go compiler will automatically capture its arguments inside the callback pointer. We don't have to think about how to encode our arguments, and in fact, our arguments can be as complex as they need to be: the resulting callback will contain a copy of them created by the Go compiler. It's essentially a poor man's JIT, and it works amazingly well in practice, both performance-wise and for ergonomics. Check out this compiler method that generates an instruction to push a TEXT SQL object from the input rows into the stack:
func (c *compiler) emitPushColumn_text(offset int, col collations.TypedCollation) {
	c.ins = append(c.ins, func(vm *VirtualMachine) int {
		vm.stack[vm.sp] = newEvalText(vm.row[offset].Raw(), col)
		vm.sp++
		return 1
	})
}

Both the offset in the input rows array and the collation for the text are statically baked into the generated instruction!

Next Steps

Wrapping up: This is a novel design for a VM that is both extremely efficient ("almost JIT" thanks to the Go compiler) and very easy to extend and maintain, coupled with an architecture that uses incremental static typing of SQL to generate very efficient compiled programs that run natively in Go and evaluate arbitrary SQL expressions without dynamic memory allocations.

The code in this PR is still an early work in progress: I've only implemented support for all artithmetic operations (a surprising amount there are!) and all comparison operators -- the bare minimum to be able to run meaningful benchmarks with complex SQL expressions and compare them with the old implementations.

My goal is to incrementally update this PR until the expression coverage is good enough that it makes sense to merge it. Since the compiled expressions can be run transparently instead of a normal AST evaluation, I would like to ship the optimizing compiler as an experimental feature in Vitess 17, to stabilize it into full coverage throughout Vitess 18, and then to eventually remove the existing AST evaluator in Vitess 19.

Amibitious? Maybe! I foresee some roadblocks, particularly regarding de-optimization, but I'll talk about those as I get closer to feature completness.


Addendum: JIT Compilation

Inquiring minds may be wondering: what's next? Are we doing JIT compilation next? The answer is no. Although this design for a compiler and VM looks like an exceptional starting point for implementing a full JIT compiler in theory, in practice the trade-off between optimization and complexity doesn't make sense. JIT compilers are important for programming languages where their bytecode operations can be optimized into a very low level of abstraction (e.g. where an "add" operator only has to perform a native x64 ADD). In these cases, the overhead of dispatching instructions becomes so dominant that replacing the VM's loop with a block of JITted code makes a significant performance difference. However, for SQL expressions, and even after our quickening pass, most of the operations remain extremely high level (things like "match this JSON object with a path" or "add two fixed-width decimals together"). The overhead of instruction dispatch, as measured in these benchmarks, is less than 20% (and can possibly be optimized further in the VM's loop). 20% is not the number you're targetting before you start fucking around with raw assembly for a JIT. So at this point my intuition is that JIT compilation would be a needlessly complex dead optimization.

Related Issue(s)

Also implements a number of missing functions from the evalengine listed in #9647

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@vmg vmg added Component: Evalengine changes to the evaluation engine Skip CI Skip CI actions from running labels Feb 14, 2023
@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says labels Feb 14, 2023
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Feb 14, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a test is added or modified, there should be a documentation on top of the test to explain what the expected behavior is what the test does.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@harshit-gangal harshit-gangal added the Benchmark me Add label to PR to run benchmarks label Feb 14, 2023
@vmg vmg force-pushed the vmg/eval-static branch 2 times, most recently from 4efaff6 to 5580566 Compare February 28, 2023 09:07
@vmg vmg removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says Skip CI Skip CI actions from running labels Feb 28, 2023
go/vt/vtgate/evalengine/arena.go Show resolved Hide resolved
go/vt/vtgate/evalengine/arithmetic.go Outdated Show resolved Hide resolved
go/vt/vtgate/evalengine/arithmetic.go Outdated Show resolved Hide resolved
go/vt/vtgate/evalengine/expr_compare.go Outdated Show resolved Hide resolved
go/vt/vtgate/evalengine/expr_compare.go Outdated Show resolved Hide resolved
go/vt/vtgate/evalengine/vm.go Outdated Show resolved Hide resolved
@dbussink dbussink added the Skip CI Skip CI actions from running label Feb 28, 2023
@GuptaManan100 GuptaManan100 removed the Benchmark me Add label to PR to run benchmarks label Feb 28, 2023
@GuptaManan100
Copy link
Member

Removing Benchmark me label because we want to run the benchmarks for v16.0.0. Please add it again when you want to run benchmarks.

@dbussink dbussink force-pushed the vmg/eval-static branch 3 times, most recently from 8d308d7 to 671a999 Compare March 3, 2023 14:05
@vmg vmg mentioned this pull request Mar 6, 2023
4 tasks
@vmg vmg force-pushed the vmg/eval-static branch 3 times, most recently from c44fea7 to 6a9928b Compare March 6, 2023 14:20
@vmg vmg force-pushed the vmg/eval-static branch 4 times, most recently from 8e8e8e2 to e36bf61 Compare March 9, 2023 15:52
@vmg vmg removed the Skip CI Skip CI actions from running label Mar 13, 2023
vmg and others added 10 commits March 16, 2023 14:47
We already had CEIL() and this is very similar, in fact, it's a tiny bit
simpler since we don't have to do the add one logic if divmod returns a
non zero remainder.

Signed-off-by: Dirkjan Bussink <[email protected]>
Since we already have CEIL(), FLOOR() etc. let's add some more numeric
operations here to the eval engine.

Signed-off-by: Dirkjan Bussink <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
@vmg vmg force-pushed the vmg/eval-static branch from 96877c7 to 5efd9f0 Compare March 16, 2023 13:47
Signed-off-by: Andres Taylor <[email protected]>
@vmg vmg requested a review from mattlord as a code owner March 16, 2023 15:46
Copy link
Collaborator

@systay systay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with the understanding that we'll continue working on this and not enable it until we feel safe doing so

This also fixes how we deal with boolean values so we don't rewrite
internal booleans on the stack and update what literal booleans mean
accidentally.

Signed-off-by: Dirkjan Bussink <[email protected]>
Copy link
Contributor

@dbussink dbussink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the baseline we have here now is good. We need to further iterate on this, but right now it's not wired up to anything to execute it and it's easier to iterate further in smaller PRs then.

So I think we should merge for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Evalengine changes to the evaluation engine Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants