-
Notifications
You must be signed in to change notification settings - Fork 20.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel Intermediate Node Fetching (for a single trie) #28266
Comments
I'm wondering about a 3rd option My main concern with the first option is that it's kind of racey. By that I mean that if I have say 2 tries and want to expand sibling leaves, then the entire path will be shared, apart of the last branching. If these paths are expanded - timewise - apart from each other, then one will warm up the cache and the second might pull it in, true. However, since execution can throw these at the prefetcher fairly quickly, you might end up with both actually working in parallel, both reading from the DB. It would depend on the db internals as to how well it can short circuit reads for the same keys, but I'm a bit uncomfortable to leave it that low. Now, with the other case of doing batch expansions, my concern is that we don't know how many accesses we need to wait for or what the timings will be between two. If I wait until the EVM execution is done, we might have wasted precious time in which we could have been busy pulling stuff from disk. It gets a bit hard to fine tune and reason about. My preferred solution would be if we could have thread safe prefetching within the trie, so that I can throw as many concurrent loads at it as I want and it would sync within. Still, there are a few catches: I'm unsure what the synchronisation cost would be at a node level, possibly big. And the other thing we want to avoid is starving the main EVM execution because we're hitting the prefetching on gazillions of threads. Limiting the thread count per trie would get ugly fast, and limiting it across all prefetching tries would get ugly even faster :) IMO, the solution here is probably to try and implement a variety of options and benchmark them one against the other. There is no clear winner either complexity or runtime wise, so I can't really say "go with X".... X might suck. My proposal would be to implement - at least up to a benchmarkable state - all variations and see how they compare against the current state. |
Thanks for the input! I'd think that the DB would handle concurrent reads for the same node around the same time well, but I could definitely be wrong and hard to say how well without a benchmark. Really appreciate the input, I'll work on taking this further and getting each to a benchmarkable state. Do you typically use the benchmarks in https://github.com/ethereum/go-ethereum/blob/master/core/blockchain_test.go or do you have a strong preference for bootstrapping to compare performance? |
@karalabe is this open to take up? Would love to take a stab at this. |
WIP for this is here: https://github.com/ethereum/go-ethereum/compare/master...aaronbuchwald:go-ethereum:trie-batch-strategies?expand=1. I'm working on setting up an EC2 to test this out when running on a large leveldb instance (instead of mocked iowait using sleep or a very small leveldb instance). After benchmarking the trie implementation in isolation, I'm going to:
@ameya-deshmukh lmk if you're interested to collaborate. A review on the existing work would be much appreciated (I think there's probably still bugs in If we have |
Updated the PR with some benchmarks that are isolated to the trie code. The trie DB does not support writing trie nodes to disk except for the normal access pattern, so I implemented some mocked node readers. For the hash based leveldb on a 50GB disk, I'm seeing:
For a mock path based (lookup trie nodes based off of path, not actually using the pathdb implementation):
For the lower trie sizes (which is unfortunately unknown to the state prefetcher and during commit) it ends up slowing down the benchmark while for larger tries it can be a 10x performance improvement. I also added a benchmark for the construction of the internal trie node used to parallelize the Also noticing that the |
Noob question here, but for the "Fetch in Parallel Using Independent Tries" first option mentioned, how fast is copying the entire trie in this line? A full copy feels more expensive than a single storage traversal, so I'm curious how this copying works underneath to stay fast. |
Great question. This approach trades off simplicity for some wasted work/resources. The copies approach is much simpler than changing the internals of the trie to support parallel operations. We can just take copies of the trie, make no changes to the internals, and run worker threads to fetch the required keys and ensure that they pull the required trie nodes into the cache of the trie database. If you look at
The tracer copy operation copies all of the previously read values, so we need to be sure to use a fresh trie, so the tracer is empty. This copies the immutable root node (changes to the trie replace the root node instead of modifying it), so the copy itself is fairly cheap. We do pay pay some additional memory allocations here since prefetching across multiple tries may result in prefetching identical intermediate nodes. In other words, if keys A and B are divided onto two different trie copies and the path through the trie shares intermediate nodes, then both trie copies will fetch the same intermediate nodes. If we wanted to mitigate this waste, we could try to sort the keys and divide them lexicographically, so that keys likely to share the same intermediate nodes are assigned to the same trie copy. Unfortunately, hashing occurs inside the trie, not from the statedb, so it's hard to handle that from here. The benchmarks show the memory allocations, so you can see the concrete numbers there. Overall, the performance is highly dependent on the workload and the workload is highly variable 😞 . We went with the trie copies approach in Coreth (ava-labs/coreth#372), but I've unfortunately dropped the ball on migrating the change upstream 😅 |
Thanks for the in-depth response! The links you provided are really handy to see that the copied object is really small and simple. In my head I was imagining the actual node contents of the tree were also being copied over but this makes a lot more sense. |
Rationale
Currently, when go-ethereum builds/executes a block it processes transactions in the EVM (while using the O(1) snapshot to perform DB reads as needed) and then needs to apply all of the state changes to the trie and compute a new state root.
Typically, the state shuffling is significantly more expensive than the actual EVM processing time (although for the new path based scheme with too much RAM this is close to no longer being the case https://twitter.com/peter_szilagyi/status/1708013385558671662).
When executing transactions in the EVM, the stateDB uses prefetching logic to attempt to pre-warm the trie's cache with all of the intermediate nodes that will need to be shuffled around when it's time to commit the statedb and compute a new state root. If the intermediate nodes are not already in memory, then this can result in a large number of DB reads which can take up a large amount of the time spent on state shuffling.
For large storage tries the performance of state fetching can be much lower since the trie implementation is not safe for concurrent use (https://github.com/ethereum/go-ethereum/blob/v1.13.1/trie/trie.go#L37). Although each storage trie gets its own prefetching goroutine (https://github.com/ethereum/go-ethereum/blob/master/core/state/trie_prefetcher.go#L153) if the majority of the workload comes from a single storage trie, then prefetching can be extremely inefficient.
Ideally, we can parallelize the DB reads to cache all of the intermediate nodes in memory prior to committing the statedb.
Implementation
There are two different approaches that I've played around with and I'm happy to complete either implementation. I wanted to put this issue up first to get feedback on whether this change makes sense to the geth team and if so, which implementation would be preferred.
The difficult part is finding a way to work around the fact that the trie implementation is not safe for concurrent use and refactoring it to support concurrent operations from an external caller would be a very non-trivial re-write.
Much better to work completely around that problem.
Fetch in Parallel Using Independent Tries
Each individual trie is not safe for concurrent use, but we can instead create multiple instances of the same trie in order to fetch all of the requested keys in parallel and pull all of the necessary intermediate trie nodes from disk into the trie database's cache.
Although the instance of the trie held by the statedb may not have all fully expanded nodes, it would fetch all of the intermediate nodes into the trie database's cache, so that when it came time to commit the trie, each trie node is already in memory.
I put up a rough implementation of what it would look like to add this into
updateTrie
within each state object here: master...aaronbuchwald:go-ethereum:parallel-fetch-update-trie. It may make more sense to move this to the prefetcher code, but this was much simpler as a proof of concept.Support
BatchCacheKeys
/BatchPut
Internally to the TrieMuch easier than parallelizing the trie for external callers is to support parallel operations internal to the trie!
Instead of re-writing the trie to support concurrency, we can construct a trie out of the get/put operations that we want to apply.
For
BatchCacheKeys
, we can construct a trie from the requested keys and arbitrary non-empty values (as required by the current trie code since an empty value is treated as a delete).Then we can traverse the actual trie and our batchTrie. By traversing both tries, we can serve all of the requests encoded in the batch trie and parallelize different sub tries that do not depend on each other.
For example:
t:
t':
In this case, we traverse r and r' and see they both have children at nibbles 0 and 1. Therefore, we can apply A' to A and B' to B in parallel since they are completely independent. Once both sub tries have completed, we can apply the resulting update to get
r''
.The same logic can be applied to either
BatchGet
orBatchPut
operations to either warm the cache or directly parallelize the put operation.I have a WIP for this implementation here: https://github.com/aaronbuchwald/go-ethereum/blob/trie-batch/trie/trie_parallel.go#L43 but since there are 4 different node types to deal with my current implementation is much more complex than I'd like and I am still running into a bug in my fuzz test when dealing with values of length 0 (which should be treated as deletes, but still pass the test unless I'm missing something).
At least as a first change, I think the first solution is much better since it's significantly simpler and since DB reads will be the dominating factor, it should be very similar in performance. Also, huge thanks to @dboehm-avalabs for coming up with this much simpler approach here: ava-labs/avalanchego#2128.
The text was updated successfully, but these errors were encountered: