Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Intermediate Node Fetching (for a single trie) #28266

Open
aaronbuchwald opened this issue Oct 6, 2023 · 8 comments
Open

Parallel Intermediate Node Fetching (for a single trie) #28266

aaronbuchwald opened this issue Oct 6, 2023 · 8 comments

Comments

@aaronbuchwald
Copy link
Contributor

Rationale

Currently, when go-ethereum builds/executes a block it processes transactions in the EVM (while using the O(1) snapshot to perform DB reads as needed) and then needs to apply all of the state changes to the trie and compute a new state root.

Typically, the state shuffling is significantly more expensive than the actual EVM processing time (although for the new path based scheme with too much RAM this is close to no longer being the case https://twitter.com/peter_szilagyi/status/1708013385558671662).

When executing transactions in the EVM, the stateDB uses prefetching logic to attempt to pre-warm the trie's cache with all of the intermediate nodes that will need to be shuffled around when it's time to commit the statedb and compute a new state root. If the intermediate nodes are not already in memory, then this can result in a large number of DB reads which can take up a large amount of the time spent on state shuffling.

For large storage tries the performance of state fetching can be much lower since the trie implementation is not safe for concurrent use (https://github.com/ethereum/go-ethereum/blob/v1.13.1/trie/trie.go#L37). Although each storage trie gets its own prefetching goroutine (https://github.com/ethereum/go-ethereum/blob/master/core/state/trie_prefetcher.go#L153) if the majority of the workload comes from a single storage trie, then prefetching can be extremely inefficient.

Ideally, we can parallelize the DB reads to cache all of the intermediate nodes in memory prior to committing the statedb.

Implementation

There are two different approaches that I've played around with and I'm happy to complete either implementation. I wanted to put this issue up first to get feedback on whether this change makes sense to the geth team and if so, which implementation would be preferred.

The difficult part is finding a way to work around the fact that the trie implementation is not safe for concurrent use and refactoring it to support concurrent operations from an external caller would be a very non-trivial re-write.

Much better to work completely around that problem.

Fetch in Parallel Using Independent Tries

Each individual trie is not safe for concurrent use, but we can instead create multiple instances of the same trie in order to fetch all of the requested keys in parallel and pull all of the necessary intermediate trie nodes from disk into the trie database's cache.

Although the instance of the trie held by the statedb may not have all fully expanded nodes, it would fetch all of the intermediate nodes into the trie database's cache, so that when it came time to commit the trie, each trie node is already in memory.

I put up a rough implementation of what it would look like to add this into updateTrie within each state object here: master...aaronbuchwald:go-ethereum:parallel-fetch-update-trie. It may make more sense to move this to the prefetcher code, but this was much simpler as a proof of concept.

Support BatchCacheKeys / BatchPut Internally to the Trie

Much easier than parallelizing the trie for external callers is to support parallel operations internal to the trie!

Instead of re-writing the trie to support concurrency, we can construct a trie out of the get/put operations that we want to apply.

For BatchCacheKeys, we can construct a trie from the requested keys and arbitrary non-empty values (as required by the current trie code since an empty value is treated as a delete).

Then we can traverse the actual trie and our batchTrie. By traversing both tries, we can serve all of the requests encoded in the batch trie and parallelize different sub tries that do not depend on each other.

For example:

t:

graph TD;
r-->0;
r-->1;
0-->A;
1-->B;
Loading

t':

graph TD;
r'-->0;
r'-->1;
0-->A';
1-->B';
Loading

In this case, we traverse r and r' and see they both have children at nibbles 0 and 1. Therefore, we can apply A' to A and B' to B in parallel since they are completely independent. Once both sub tries have completed, we can apply the resulting update to get r''.

The same logic can be applied to either BatchGet or BatchPut operations to either warm the cache or directly parallelize the put operation.

I have a WIP for this implementation here: https://github.com/aaronbuchwald/go-ethereum/blob/trie-batch/trie/trie_parallel.go#L43 but since there are 4 different node types to deal with my current implementation is much more complex than I'd like and I am still running into a bug in my fuzz test when dealing with values of length 0 (which should be treated as deletes, but still pass the test unless I'm missing something).

At least as a first change, I think the first solution is much better since it's significantly simpler and since DB reads will be the dominating factor, it should be very similar in performance. Also, huge thanks to @dboehm-avalabs for coming up with this much simpler approach here: ava-labs/avalanchego#2128.

@karalabe
Copy link
Member

karalabe commented Oct 6, 2023

I'm wondering about a 3rd option :trollface:

My main concern with the first option is that it's kind of racey. By that I mean that if I have say 2 tries and want to expand sibling leaves, then the entire path will be shared, apart of the last branching. If these paths are expanded - timewise - apart from each other, then one will warm up the cache and the second might pull it in, true. However, since execution can throw these at the prefetcher fairly quickly, you might end up with both actually working in parallel, both reading from the DB. It would depend on the db internals as to how well it can short circuit reads for the same keys, but I'm a bit uncomfortable to leave it that low.

Now, with the other case of doing batch expansions, my concern is that we don't know how many accesses we need to wait for or what the timings will be between two. If I wait until the EVM execution is done, we might have wasted precious time in which we could have been busy pulling stuff from disk. It gets a bit hard to fine tune and reason about.

My preferred solution would be if we could have thread safe prefetching within the trie, so that I can throw as many concurrent loads at it as I want and it would sync within. Still, there are a few catches: I'm unsure what the synchronisation cost would be at a node level, possibly big. And the other thing we want to avoid is starving the main EVM execution because we're hitting the prefetching on gazillions of threads. Limiting the thread count per trie would get ugly fast, and limiting it across all prefetching tries would get ugly even faster :)

IMO, the solution here is probably to try and implement a variety of options and benchmark them one against the other. There is no clear winner either complexity or runtime wise, so I can't really say "go with X".... X might suck. My proposal would be to implement - at least up to a benchmarkable state - all variations and see how they compare against the current state.

@aaronbuchwald
Copy link
Contributor Author

aaronbuchwald commented Oct 6, 2023

Thanks for the input!

I'd think that the DB would handle concurrent reads for the same node around the same time well, but I could definitely be wrong and hard to say how well without a benchmark.

Really appreciate the input, I'll work on taking this further and getting each to a benchmarkable state.

Do you typically use the benchmarks in https://github.com/ethereum/go-ethereum/blob/master/core/blockchain_test.go or do you have a strong preference for bootstrapping to compare performance?

@ameya-deshmukh
Copy link

@karalabe is this open to take up? Would love to take a stab at this.

@aaronbuchwald
Copy link
Contributor Author

WIP for this is here: https://github.com/ethereum/go-ethereum/compare/master...aaronbuchwald:go-ethereum:trie-batch-strategies?expand=1.

I'm working on setting up an EC2 to test this out when running on a large leveldb instance (instead of mocked iowait using sleep or a very small leveldb instance).

After benchmarking the trie implementation in isolation, I'm going to:

  1. refactor the trie prefetcher to use any of the batched prefetch functions
  2. benchmark each batch prefetch function used in the trie prefetcher
  3. benchmark using the parallel prefetcher prior to large storage trie writes

@ameya-deshmukh lmk if you're interested to collaborate. A review on the existing work would be much appreciated (I think there's probably still bugs in BatchGet but I reached a point where the fuzz test struggles to find a new issue). You could also take a stab at implementing a BatchPut operation that follows the same style as BatchGet to perform put a batch of put operations in parallel internal to the trie.

If we have BatchPut, then we can use the prefetcher prior to committing the trie and perform BatchPut when it's time to commit so that any remaining work is performed in parallel.

@aaronbuchwald
Copy link
Contributor Author

Updated the PR with some benchmarks that are isolated to the trie code.

The trie DB does not support writing trie nodes to disk except for the normal access pattern, so I implemented some mocked node readers.

For the hash based leveldb on a 50GB disk, I'm seeing:

BenchmarkBatchPrefetchStrategies/BatchPrefetch-TrieSize=10000-NumOps=10000-Parallelism=100-8         	       1	  15510834 ns/op	      1004 reads/op	 200520869 setup/op	17771864 B/op	  148579 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchCopies-TrieSize=10000-NumOps=10000-Parallelism=100-8   	       1	  20603739 ns/op	      1019 reads/op	 149537223 setup/op	32820912 B/op	  397762 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchSequential-TrieSize=10000-NumOps=10000-8               	       1	   7892993 ns/op	       973.0 reads/op	1242217518 setup/op	 3039272 B/op	   46459 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetch-TrieSize=1000000-NumOps=10000-Parallelism=100-8       	       1	  78410055 ns/op	     13437 reads/op	3166969154 setup/op	72700384 B/op	  789398 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchCopies-TrieSize=1000000-NumOps=10000-Parallelism=100-8 	       1	 106937484 ns/op	     13460 reads/op	 550216983 setup/op	120255304 B/op	 1250465 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchSequential-TrieSize=1000000-NumOps=10000-8             	       1	 342994046 ns/op	     13363 reads/op	3332154309 setup/op	77223752 B/op	  778211 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetch-TrieSize=10000000-NumOps=10000-Parallelism=100-8      	       1	 301627975 ns/op	     21315 reads/op	14254487217 setup/op	151359424 B/op	 1814877 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchCopies-TrieSize=10000000-NumOps=10000-Parallelism=100-8          1	 357851357 ns/op	     21402 reads/op	18521055157 setup/op	199365208 B/op	 2571689 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchSequential-TrieSize=10000000-NumOps=10000-8                      1	5212844952 ns/op	     21268 reads/op	17684188933 setup/op	441997576 B/op	 4154303 allocs/op

For a mock path based (lookup trie nodes based off of path, not actually using the pathdb implementation):

BenchmarkBatchPrefetchStrategies/BatchPrefetch-TrieSize=10000-NumOps=10000-Parallelism=100-8         	       1	  17083787 ns/op	       990.0 reads/op	1294497008 setup/op	18927160 B/op	  149373 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchCopies-TrieSize=10000-NumOps=10000-Parallelism=100-8   	       1	  14954369 ns/op	       996.0 reads/op	1280698162 setup/op	29083600 B/op	  330451 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchSequential-TrieSize=10000-NumOps=10000-8               	       1	  10130656 ns/op	       954.0 reads/op	1231005866 setup/op	 5074472 B/op	   46426 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetch-TrieSize=1000000-NumOps=10000-Parallelism=100-8       	       1	  52354034 ns/op	     13319 reads/op	3329864425 setup/op	59983688 B/op	  663393 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchCopies-TrieSize=1000000-NumOps=10000-Parallelism=100-8 	       1	  88611409 ns/op	     13585 reads/op	 601209565 setup/op	104368504 B/op	 1114752 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchSequential-TrieSize=1000000-NumOps=10000-8             	       1	 292722501 ns/op	     13433 reads/op	 599777794 setup/op	87862248 B/op	  883739 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetch-TrieSize=10000000-NumOps=10000-Parallelism=100-8      	       1	 111618086 ns/op	     21315 reads/op	8705843886 setup/op	117762184 B/op	 1367677 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchCopies-TrieSize=10000000-NumOps=10000-Parallelism=100-8        1	 214599153 ns/op	     21430 reads/op	7060706413 setup/op	191390792 B/op	 2300811 allocs/op
BenchmarkBatchPrefetchStrategies/BatchPrefetchSequential-TrieSize=10000000-NumOps=10000-8                    1	 531480799 ns/op	     21255 reads/op	8568405414 setup/op	154002872 B/op	 1605778 allocs/op

For the lower trie sizes (which is unfortunately unknown to the state prefetcher and during commit) it ends up slowing down the benchmark while for larger tries it can be a 10x performance improvement.

I also added a benchmark for the construction of the internal trie node used to parallelize the BatchPrefetch operation and for 10k operations this takes 10ms, which is larger than the performance difference between BatchPrefetch and BatchPrefetchSequential. In other words, the setup cost is high and results in an overall decrease in performance for a small trie and large number of operations, but the parallel execution is still slightly faster than the sequential execution.

Also noticing that the BatchPrefetch implementation is slightly faster than BatchPrefetchWithCopies, but not by an order of magnitude. Since it's significantly more complex, BatchPrefetchWithCopies is probably the better first pass imo.

@ilikesymmetry
Copy link

Noob question here, but for the "Fetch in Parallel Using Independent Tries" first option mentioned, how fast is copying the entire trie in this line? A full copy feels more expensive than a single storage traversal, so I'm curious how this copying works underneath to stay fast.

@aaronbuchwald
Copy link
Contributor Author

Noob question here, but for the "Fetch in Parallel Using Independent Tries" first option mentioned, how fast is copying the entire trie in this line? A full copy feels more expensive than a single storage traversal, so I'm curious how this copying works underneath to stay fast.

Great question. This approach trades off simplicity for some wasted work/resources.

The copies approach is much simpler than changing the internals of the trie to support parallel operations. We can just take copies of the trie, make no changes to the internals, and run worker threads to fetch the required keys and ensure that they pull the required trie nodes into the cache of the trie database.

If you look at BatchPrefetch, it's unfortunately much more difficult to be confident in its correctness and requires a decent amount of modifications to the trie code and would be more difficult to maintain.

BatchPrefetchWithCopies creates duplicate tries, but the copy operation is very cheap when performed on a fresh trie:

https://github.com/aaronbuchwald/go-ethereum/blob/parallel-fetch-update-trie/core/state/database.go#L190-L197

https://github.com/aaronbuchwald/go-ethereum/blob/parallel-fetch-update-trie/trie/trie.go#L65

The tracer copy operation copies all of the previously read values, so we need to be sure to use a fresh trie, so the tracer is empty.

This copies the immutable root node (changes to the trie replace the root node instead of modifying it), so the copy itself is fairly cheap.

We do pay pay some additional memory allocations here since prefetching across multiple tries may result in prefetching identical intermediate nodes. In other words, if keys A and B are divided onto two different trie copies and the path through the trie shares intermediate nodes, then both trie copies will fetch the same intermediate nodes. If we wanted to mitigate this waste, we could try to sort the keys and divide them lexicographically, so that keys likely to share the same intermediate nodes are assigned to the same trie copy. Unfortunately, hashing occurs inside the trie, not from the statedb, so it's hard to handle that from here.

The benchmarks show the memory allocations, so you can see the concrete numbers there. Overall, the performance is highly dependent on the workload and the workload is highly variable 😞 . We went with the trie copies approach in Coreth (ava-labs/coreth#372), but I've unfortunately dropped the ball on migrating the change upstream 😅

@ilikesymmetry
Copy link

Thanks for the in-depth response! The links you provided are really handy to see that the copied object is really small and simple. In my head I was imagining the actual node contents of the tree were also being copied over but this makes a lot more sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants