-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(grandfather): Snapshotted trees #3207
Comments
Here is how we could do this: 1. Snapshot approach E.g. every 5 blocks we would copy the DB and then we would replay the changes if we need to. So if I requested a sibling path of a leaf at block 8, I would get the db at block 5 and apply changes to it from block 6, 7 and 8. These changes are stored in the archival node so it shouldn't be that hard. Advantage of this approach is that it's quite simple but a big disadvantage is that it results in a lot of storage. 2. Storing-changes approach What we could do is expand this key with block number so (tree_name, level, index) would become (tree_name, block_number, level, index). But this begs a question of how would we figure out at would blocks each node was changed upon sibling path retrieval. That is, if I want to fetch a leaf at index 6 and its sibling path at block 10 how would I determine which keys to use? 2.1 Append-only tree 2.2 Sparse tree Bellow is an example flow of how fetching a leaf and a sibling path would work when trying to get a leaf at index 9 at block 259.
It's quite likely that the search back could be performed very efficiently using bitwise operations. Overall I think it doesn't make sense to go with approach 1. because it's too inefficient, replaying can be tricky, and approach 2. is not that hard to implement. I think for the append-only tree the approach is quite elegant but I have doubts about the sparse tree. If someone is aware of a better way how to do this LMK. I looked at what Erigon is doing and their optimization seem to be too low-level and hence irrelevant for sandbox. Here are their docs for anyone curious. @iAmMichaelConnor we have a nice problem here which is blocking quite a lot of stuff 😆 |
Not sure we actually need a lot of input from @iAmMichaelConnor since this is more a implementation "detail" than something that changes the protocol. Essentially how do we store it efficiently for our case, but it don't change much in the protocol itself. I think going with solution 1 might be fine for the sandbox, where the lifetime of the chain is somewhat limited, so the data that we need to store for each snapshot is not huge. When we are talking kb/mb of storage, it seems like a no-brainer to me as it should be possible to do quite quickly (I say naivly). Since our nullifier updates are also quite funky from the tree view, it seems like spending a lot of engineering power here now is over-engineering 🤷. For a production note, it would be horrible to do this, but I think it is workable here. For solution 2, it sounds like you essentially end up with something that is like a dynamic predecessor problem? (Predecessor problem, finding the largest element |
Fun problem! I love implementing tree stuff! An interesting future enhancement would be to enable a pxe to subscribe to hash path changes (each block) for the notes in the user's db. E.g. if a user owned a note, they could subscribe to the hash paths of the "low nullifier(s)" which 'jump over' that note's nullifier. This would allow a user to prove "this note wasn't nullified yet, in historic block b", without having to run an archive node. |
I think an alternative would be to store the tree nodes in the database as content-addressed entries (e.g. like pointers in memory) where the key is the actual value (hash) of that node and the value of the entry is a tuple of the two children that make up the node. This would allow us to structurally share subtrees between different blocks (e.g. if something changes only the right subtree then the new root can point to the old left-subtree and the new right-subtree). Getting the value of a leaf at a particular index at a particular block number comes down to traversing the tree down from the root (conveniently this would also give us the merkle path). To get the root of a subtree we could hold known references, e.g. Example with an append tree Initially the tree and database are both empty (don't store null hashes) Appending Appending This becomes much more efficient with batched inserts because in the database we only need to keep the end state of the tree for each block. Sparse tree This would work the same way for sparse trees. I'll have a think about how much storage this approach would take in the database. WDYT? |
I quite like it. Think it have similar performance to the ideas from @benesjan with append-only, but for sparse and nullifier trees it might go crazy with nodes to be updated. I think we can could use it as the base idea of the trees, but would probably still need something like the store only every |
I think for an append-only tree the only thing that needs to be stored per snapshot is up to which index we've written leaves to. Since the tree only ever grows by adding leaves and historic values never change, we just have to check if a leaf's index was set at a particular block number (ie leaf index < max leaf index at block) and maybe the tree root for good measure. Am I missing something? This sounds really cheap and efficient. Merkle proof would be a bit more involved though since we can't compute it on the most recent tree. |
The reason we want the snapshots is to get the sibling paths for merle inclusion proofs such that we can prove that they were part of historic state. |
This PR adds two different algorithms to create snapshots of merkle trees. ## Notation N - number of non-zero nodes H - height of tree M - number of snapshots ## ~Incremental~ Full snapshots This algorithm stores the trees in a database instance in the same way they would be stored using pointers in memory. Each node has two key-value entries in the database: one for its left child and one for its right child. Taking a snapshot means just walking the tree (BF order), storing any new nodes. If we hit a node that already exists in the database, that means the subtree rooted at that node has not changed and so we can skip checking it. Building a sibling path is just a traversal of the tree from historic root to historic leaf. Pros: 1. its generic enough that it works with any type of merkle tree (one caveat: we'll need an extension to store historic leaf values for indexed-trees) 2. it shares structure with other versions of the tree: if some subtree hasn't changed between two snapshots then it will be reused for both snapshots (would work well for append-only trees) 3. getting a historic sibling path is extremely cheap only requiring O(H) database reads Cons: 1. it takes up a space to store every snapshot. Worst case scenario it would take up an extra O(N * M) space (ie. the tree changes entirely after every snapshot). For append-only trees it would be more space-efficient (e.g. once we start filling the right subtree, the left subtree will be shared by every future snapshot). More details in this comment #3207 (comment). ## Append-only snapshots For append-only trees we can have an infinite number of snapshots with just O(N + M) extra space (N - number of nodes in the tree, M - number of snapshots) at the cost of sibling paths possibly requiring O(H) hashes. This way of storing snapshots only works for `AppendOnlyTree` and exploits two properties of this type of tree: 1. once a leaf is set, it can never be changed (i.e. the tree is append-only). This property can be generalised to: once a subtree is completely filled then none of its nodes can ever change. 2. there are no gaps in the leaves: at any moment in time T we can say that the first K leaves of the tree are filled and the last 2^H-K leaves are zeroes. The algorithm stores a copy of the tree in the database: - the last snapshot of that node (ie v1, v2, etc) - the node value at that snapshot Taking the snapshot is also a BF traversal of the tree, comparing the current value of nodes with its previously stored value. If the value is different we update both entries. If the values are the same then the node hasn't changed and by the first property, none of the nodes in its subtree have changed so it returns early. For each snapshot we also store how many leaves have been filled (e.g. at v1 we had 10 leaves, at v2 33 leaves, etc) Building a sibling path is more involved and could potentially require O(H) hashes: 1. walk the historic tree from leaf to root 2. check the sibling's last snapshot - if it's before the target snapshot version then we can safely use the node (e.g. node A was last changed at time T3 and we're building a proof for the tree at time T10, then by property 1 we can conclude that neither node A nor the subtree rooted at A will ever change at any moment in time Tn where n > 3) - if the node has changed then we have to rebuild its historic value at the snapshot we're interested in - to do this we check how "wide" the subtree rooted at that node is (e.g. what's the first leaf index of the subtree) and if it intersects with the filled leaf set at that snapshot. If it doesn't intersect at all (e.g. my subtree's first leaf is 11 but only 5 leaves were filled at the time) then we can safely conclude that the whole subtree has some hash of zeros - if it does intersect go down one level and level apply step 2 again. - the invariant is: we will either reach a leaf and that leaf was changed in the version we're interested in, or we reach a node that was changed before the version we're interested in (and we return early) or we reach a node that was historically a hash of zero Two (big) SVGs showing the sibling path algorithm step-by-step [Average case](https://github.com/AztecProtocol/aztec-packages/assets/3816165/b87fa6eb-bcf4-42ca-879d-173a76d802bb) [Drawing of sibling path algorithm in the worst case](https://github.com/AztecProtocol/aztec-packages/assets/3816165/dd3788ec-6357-4fab-bf78-3496a2948040) Pros: - low space requirements only needing O(N+M) extra space. Cons: - generating a sibling path involves mixed workload: partly reading from the database and partly hashing. Worst case scenario O(H) db reads + O(H) hashing - doesn't work for `IndexedTree`s because even though it only supports appending new leaves, internally it updates its leaf values. Fix #3207 --------- Co-authored-by: PhilWindle <[email protected]>
The power of the grandfather lies in reading from many different blocks in the same transaction. However, this is a true pain if the trees are not snapshotted, since you effectively need what behave as an Ethereum archive node where you can look up paths of a specific version of the note-hash and other trees.
Without snapshotting you will practically just be using the last block always. Need to figure out how state snapshotting can happen, and make it simple devex, similar to including a param with the block you want to query at.
The text was updated successfully, but these errors were encountered: