-
Notifications
You must be signed in to change notification settings - Fork 20.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
all: implement path based storage #23733
Conversation
This comment tries to explain the design of path based storage scheme in high level. Path based storage schemeThe new state scheme is based on the node path, all nodes are saved in disk with the path as key. It means there is only a single version trie stored in the disk and will be updated in place whenever chain makes progress. The database key of trie node is encoded with node path. To be precise, the key format can be represented by this diagram. + only for storage trie node
/
+------------+-----------------------+-------------------------------------+
| Key Prefix | 32 bytes account hash | Suffix compact encoding(node path) |
+------------+-----------------------+-------------------------------------+ Key prefixThe key prefix is a single byte which is used to indicate the database entry type. Since the node key length is non-fixed, it's hard for In the implementation, the key prefix is chosen as Storage namespaceIn ethereum, there are two layers tries: account trie and a bunch of storage tries. Thus for a node in storage trie, the real path is consisted by two parts: the path in account trie and the path in storage trie. The account hash(the path of the node in the account trie) is encoded into the database key. Internal pathThe last part is the internal path in the trie. Here the Note, keys with shared prefix can be compressed properly by Leveldb, it's the main reason for choosing this encoding scheme.
With the new storage scheme, there is only one version trie node belongs to the specific trie path be saved in the disk. The entire storage key encoding is implemented here AlternativesThere are a few other encoding schemes. Check the proposal for more details. Suggestions are appreciated! In-memory trie nodesIn order to serve shallow reorgs (shorter than the depth of the persistent singleton trie), a bunch of trie nodes need to be maintained in memory for a while, flush them into the disk only if they are mature enough(at least 128 block confirms). The structure for managing all in-memory nodes is called There are a few functionalities supported by
The structure can be represented by this diagram, just like the snapshot tree, trie nodes belong to different state are organized in different layers. Layers are linked with each other by parent-child relationship. There are at most 128 depth layers will be maintained, the bottom-most diff layer will be flushed into disk if there are too many nodes accumulated. +-----------------+-----------------+
| State Y | State Y' |
+-----------------+-----------------+
| State X | State X' |
+-----------------------------------+
| ....... |
+-----------------------------------+
| Bottom-most diff layer |
+-----------------------------------+
| Disk layer(singleton trie) |
+-----------------------------------+ The Trie changesStateless commitPerviously, However, in the new scheme, So the commit procedure is changed a bit. Instead of pushing committed nodes blindly into dirty set, the committed nodes will be returned by trie, callers need to aggregate the results from multiple tries(account trie, storage tries) to a complete state diff for building a new diff layer. Checking here for more implementation details. Self-contained dirty setAfter commit operation, a result object will be generated which contains all dirty nodes. Except returning the result object to outside, the trie itself also needs to maintain a dirty node set in order to access the latest state. So in trie, an additional set is added. All the dirty nodes created since the trie creation will be recorded here. Note the dirty set of trie will never be released unless the trie object is deallocated. State sync changesRetire fast syncSince in the new scheme, the trie nodes are stored by path. Therefore accessing trie nodes via hash is impossible which breaks the fast sync. Retiring fast sync is not done by this PR, but I believe when we try to merge this one, it's already retired. Twist snap syncIn the snap sync healing stage, the standard procedure of downloading missing nodes is:
It's slightly changed in new scheme. The new procedure is:
So that stale state with same path can be overwritten by state syncer silently. And eventually the synced state strictly match with the latest version. Implementation wise, a new key scheme called It's widely used in the project(e.g. state syncer, trie.Database, sync bloom) which can distinguish different versions of nodes(same path, different hash). Upgrade pathUpgrade from the legacy scheme to the path-based scheme requires a re-sync in theory. However a trick can be applied to avoid it(but it's still recommended to do a re-sync in order to slim down your node). References |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive piece of work!
It's going to take a while to go through though :)
core/rawdb/database.go
Outdated
case bytes.HasPrefix(key, TrieNodePrefix): | ||
tries.Add(size) | ||
case len(key) == common.HashLength: | ||
archiveTries.Add(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't 1/256:th of all archive tries be mis-tagged as tries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? :P
Legacy nodes and archive nodes are all stored with node hash. I think it's fine to put the legacy nodes into the archive node category.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that out of all un-prefixed 32-byte keys that are raw trie nodes (archive), 1/256 will have their first byte be w
, so be tagged as prefixed-tries instead of raw tries.
It's not highly important here, but in general those types of errors can be messy to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes you are right. It's really annoying to handle the legacy format trie nodes. I think we have this issue all in time even in the current master. Although the archive nodes in the current code should only be a small part(genesis) in most cases, but yes, let's try to make it better.
It would be interesting to test this approach in this context: #22497 (comment) Block |
Actually in the core.StateDB, for the deleted state object, we won't explicitly clean out the storage. So IIUC, nothing different will happen. It's an interesting question for me. How should we handle the destructed account storage. It makes no sense to leak them out in the disk, but a proper solution needs to be found to avoid unexpected OOM. |
Hm, so in the current implementation, the storage trie will be left untouched after the deletion? If so, what will happen if
How do you avoid the accidental resurrection of the storage slots (which should now be cleared) if you haven't deleted them by-path? |
Actually it's OK to leave the deleted state data in the disk. Like in the case you described above, Contract A is recreated with an empty storage root hash, in this case the leaked storage data won't be accessed since node hash is not matched. More specifically, In trie.Database.diffLayer, nodes are retrieved with internal key which is consisted with node hash. https://github.com/sscodereth/go-ethereum/blob/path-based-storage/trie/difflayer.go#L116 In trie.Database.diskLayer, nodes are retrieved with storage key(path), but data hash will still be compared with the requested one https://github.com/sscodereth/go-ethereum/blob/path-based-storage/trie/disklayer.go#L81 Let's go back to the scenario, the Contract A won't load any data for resolving an non-existent root node, so the left storage data won't be accessed at all. |
But we definitely don't want to leak out any deleted data, I will improve this part for sure. But just want to clarify that even deleted data is leaked, the correctness won't be affected. |
But actually I don't fully understand this. If we want to delete the storage slots for a huge contract, the disk write will be chunked into several small writes. But yes, the disk write can be huge, not sure it will lead to OOM relevant code: // Ensure we don't delete too much data blindly (contract can be
// huge). It's ok to flush, the root will go missing in case of a
// crash and we'll detect and regenerate the snapshot.
if batch.ValueSize() > ethdb.IdealBatchSize {
if err := batch.Write(); err != nil {
log.Crit("Failed to write storage deletions", "err", err)
}
batch.Reset()
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the whole PR, but I feel I need some more time for things to sink in. From what I can see, it will not conflict too much with my verkle tree work, but adapting it to the new multi-layer format will definitely require a lot of work. In spite of this, I think this is a welcome change and a lot of great work.
I see a lot of the snapshot code has been duplicated. With your model, does the "old" snapshot still make sense, if so how, and if not, should it not be deleted?
@gballet thanks for your feedback!
Yes, the multi-layer structure is duplicated. Since I found it's a perfect data structure in ethereum in order to handle the mini reorgs. Snapshot is still needed since it offers the direct state access. And actually I have been thinking to abstract the multi-layer structure out to remove the duplication. But I think it's not a good time to do it right now, reasons are: (1) it's already a huge change, refactoring will make it even harder to review (2) I still have to finish reverse-diff part, would be nice to give the priority to that part first. But you are totally right, the duplication is ugly :) I will try to abstract it out a bit later. |
Yes, it's changed a bit. For example in trie, originally the dirty nodes are pushed into disk directly in Commit, but they are now wrapped in the CommitResult and returned to caller. It might affect your verkle trie. |
- Fix non-existent node resolving in Node API - Add benchmarks for NodeBlob API - Use the node hash as the bloom filter id - Change pararameters for Trie.Database
This brings down diff --git a/trie/difflayer.go b/trie/difflayer.go
index 518ed76915..1debf377c1 100644
--- a/trie/difflayer.go
+++ b/trie/difflayer.go
@@ -215,27 +215,20 @@ func (dl *diffLayer) NodeBlob(key []byte) ([]byte, error) {
// all the maps in all the layers below
_, hash := DecodeInternalKey(key)
dl.lock.RLock()
- hit := dl.diffed.ContainsHash(binary.BigEndian.Uint64(hash.Bytes()))
- var origin *diskLayer
- if !hit {
- origin = dl.origin // extract origin while holding the lock
+ if dl.diffed.ContainsHash(binary.BigEndian.Uint64(hash.Bytes())) {
+ d, err := dl.nodeBlob(key, 0)
+ dl.lock.RUnlock()
+ return d, err
}
dl.lock.RUnlock()
-
// If the bloom filter misses, don't even bother with traversing the memory
// diff layers, reach straight into the bottom persistent disk layer
- if origin != nil {
- triedbBloomMissMeter.Mark(1)
- return origin.NodeBlob(key)
- }
- return dl.nodeBlob(key, 0)
+ triedbBloomMissMeter.Mark(1)
+ return dl.origin.NodeBlob(key)
}
// nodeBlob is the inner version of NodeBlob which counts the accessed layer depth.
func (dl *diffLayer) nodeBlob(key []byte, depth int) ([]byte, error) {
- dl.lock.RLock()
- defer dl.lock.RUnlock()
-
// If the layer was flattened into, consider it invalid (any live reference to
// the original should be marked as unusable).
if dl.Stale() { |
I pushed an error-check into the benchmark |
Hm, funky... It builds fine if I remove diff --git a/trie/database.go b/trie/database.go
index aa6bc80b68..52cba6f2fc 100644
--- a/trie/database.go
+++ b/trie/database.go
@@ -99,7 +99,7 @@ type snapshot interface {
// Node retrieves the trie node associated with a particular key. The
// passed key should be encoded in internal format with hash encoded.
// No error will be returned if the node is not found.
- Node(internalKey []byte) (node, error)
+ //Node(internalKey []byte) (node, error)
// Parent returns the subsequent layer of a snapshot, or nil if the base was
// reached.
diff --git a/trie/difflayer.go b/trie/difflayer.go
index 518ed76915..c1f3c92cbf 100644
--- a/trie/difflayer.go
+++ b/trie/difflayer.go
@@ -155,59 +155,6 @@ func (dl *diffLayer) Stale() bool {
return atomic.LoadUint32(&dl.stale) != 0
}
-// Node retrieves the trie node associated with a particular key.
-// The given key must be the internal format node key.
-func (dl *diffLayer) Node(key []byte) (node, error) {
- // Check the bloom filter first whether there's even a point in reaching into
- // all the maps in all the layers below
- _, hash := DecodeInternalKey(key)
- dl.lock.RLock()
- hit := dl.diffed.ContainsHash(binary.BigEndian.Uint64(hash.Bytes()))
- var origin *diskLayer
- if !hit {
- origin = dl.origin // extract origin while holding the lock
- }
- dl.lock.RUnlock()
-
- // If the bloom filter misses, don't even bother with traversing the memory
- // diff layers, reach straight into the bottom persistent disk layer
- if origin != nil {
- triedbBloomMissMeter.Mark(1)
- return origin.Node(key)
- }
- return dl.node(key, 0)
-}
-
-// node is the inner version of Node which counts the accessed layer depth.
-func (dl *diffLayer) node(key []byte, depth int) (node, error) {
- // If the layer was flattened into, consider it invalid (any live reference to
- // the original should be marked as unusable).
- if dl.Stale() {
- return nil, ErrSnapshotStale
- }
- // If the trie node is known locally, return it
- if n, ok := dl.nodes[string(key)]; ok {
- triedbDirtyHitMeter.Mark(1)
- triedbDirtyNodeHitDepthHist.Update(int64(depth))
- triedbBloomTrueHitMeter.Mark(1)
-
- // The trie node is marked as deleted, don't bother parent anymore.
- if n == nil {
- return nil, nil
- }
- triedbDirtyReadMeter.Mark(int64(n.size))
- _, hash := DecodeInternalKey(key)
- return n.obj(hash), nil
- }
- // Trie node unknown to this diff, resolve from parent
- if diff, ok := dl.parent.(*diffLayer); ok {
- return diff.node(key, depth+1)
- }
- // Failed to resolve through diff layers, mark a bloom error and use the disk
- triedbBloomFalseHitMeter.Mark(1)
- return dl.parent.Node(key)
-}
-
// NodeBlob retrieves the trie node blob associated with a particular key.
// The given key must be the internal format node key.
func (dl *diffLayer) NodeBlob(key []byte) ([]byte, error) {
So -- we could choose to just store it as blobs, instead of storing as references to |
I tested implementing that, seems pretty straight-forward: holiman@81af9ab |
blob, nodeHash := rawdb.ReadTrieNode(dl.diskdb, path) | ||
if len(blob) == 0 || nodeHash != hash { | ||
blob = rawdb.ReadArchiveTrieNode(dl.diskdb, hash) | ||
if len(blob) != 0 { | ||
triedbFallbackHitMeter.Mark(1) | ||
triedbFallbackReadMeter.Mark(int64(len(blob))) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand this. In the 'normal' case, will this happen e.g if someone does an ether-transfer to a non-existing account?
If so, it seems pretty expensive, if every non-existing trie access triggers an additional ReadArchiveTrieNode
?
But I'm not sure I fully understand what situations would trigger this to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's true. For all the non-existing entries, they will trigger an additional ReadArchiveTrieNode
. It acts the fallback, mainly for the nodes upgraded from the legacy version.
The |
d0fe0d4
to
9ea7c24
Compare
But it's not used though, is it? |
Happened during shutdown on one of the benchmarkers:
|
Closing in favour of #24597 |
This PR implements the path based storage scheme. Note, it's a WIP PR, open it just want to ensure it's on the right track.