-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Waku Sync #80
Comments
I would focus for now on a simplifying assumption that Sync is between two nodes that already "know" they want to reconcile all cached message hashes - in other words, sync everything. Not only is this how Sync will first function (with the exception of perhaps a sync list per shard), but it focuses our initial efforts into a building block that can be adapted for all these use cases.
Well, it does assume a key-value store with unique addressable keys, message hashes in our case.
But when a diff is found the fingerprint for the "dynamic" subranges would have to be recomputed each time, or am I missing something? This remains my main concern for RBSR scalability if frequently syncing with multiple peers for an ever-growing data set. Not a showstopper, but scalability may be something to model or explicitly verify in experiments.
In our case, assuming regular syncing, most reconciliation would happen on the "right side" for the most recent messages. Presumably this has approximately equal impact on RBSR and Prolly Trees (?), but worth keeping this in mind when considering options.
If this is a simple(ish) integration, this would already be good enough reason to build POC using RBSR above Prolly Trees IMO. |
💯
RBSR can function well without a DB. I'll rephrase, I meant in the sense that Sync only deals in hashes not messages themselves.
Messages are already hashed. At no point hashing re-occurs. The "hash" of a range is the XOR (plus hardening) of all the message hashes.
Yes, good point! RBSR can limit the sync to the most recent messages as well as Prolly trees.
I estimate that ~4-6 months of work would be required to make a Prolly tree Sync. If we use Negentropy 3-4 weeks might be enough. |
In case of Waku sync, deletion would not be a use-case right as only stored messages are being synced across and a node will only ever be missing or having additional messages. |
Correct. No deletion (only removing oldest entries based on dimensioning) and no modification of entries. |
I've been thinking of adding a new protocol. Let's call it The new protocol would be used for this. The idea would be to send our desired max frame size and sync range then wait for a response. We then compare the response with ours and pick the smallest of the values and initiate sync. Once |
Not sure I follow - isn't this exactly what negentropy sync does? return a diff list of message hashes?
Perhaps sync range can be a network specification, but indeed - we probably need two nodes to agree on the exact time range to sync for this to work. However, this sounds to me like an extension of the Sync protocol itself rather than a new protocol? (I.e. a wrapper and extension on negentropy)? What do you mean by "frame size"? |
Yes but there's 2 lists; hashes I want, hashes the other node needs.
Maybe just an extention would work and |
I too think this should be the approach where-in we just create a wrapper protocol inside which the negentropy payload is a field. Other fields can be time range of sync, max-frame-size for now. If this is the case i am wondering if the protocol sequence would be something like this:
@SionoiS In this case, we may want the server side also to create a new negentropy instance for each sync request it handles. Because the frameSize is set during creation of negentropy instance (similar to how we are doing for each sync method call on client side). The storage would remain same across all instances. |
But this doesn't need to be communicated separately right. During the negentropy sync procedure itself both client and server can know what hashes they need and the other have. This can be achieved by simple using the reconcile method that takes in have and needHashes on server-side as well. We don't need to communicate this separately from client to server. |
I think the StoreSync protocol should define something like below:
|
Are you sure? From what I can see of the code, only the initiator can use the reconcile with list as output. This restriction seams arbitrary but there's must be a good reason for it... Maybe we could try making some modification to the original code? |
What I see is that if we want to verify the messages we receive we can't just use Maybe RLN-Relay could expose a proc that can be used to verify any messages? We can't verify them without RLN anyway. |
I too agree we need a StoreSync protocol a layer above sync protocol. I tried to list down its functions here. But I am wondering if Message verification can be part of Store protocol as any client requesting messages from a StoreNode can verify if needed. But then again, if a client(other than storeNode) is requesting, it can validate using application logic as well. So, maybe we can keep it restricted to StoreSync to start with. |
Ok, i see it now in the code. But wondering what could be the reason to differentiate initiator and non-initiator in this case. Looks like as per negentropy, initiator is the one that informs the other side of elements that other side doesn't have.
Maybe it is better to not modify this and rather take the approach like you have suggested, i.e the sync protocol can then inform the peer that it has haveIds which the peer doesn't have. |
Yes, but the x in this case is also a factor of the chosen archive policy. If archive keep the last 30m then Sync can't be 1h.
Yes when a Store node start it should check how long it's been offline. Maybe Store should be responsible for that?
Yes but this seams lower priority.
Sync should talk to Store and keep the set of messages the same. On the technical side we should use 1 DB if possible. Having an in-memory storage for Sync seams not necessary. Let's store the negentropy tree in the DB instead.
If gossips were reliable we would not have to create Sync. Work duplication seams inevitable?
Yes, but for now lets just sync with a random peer at certain interval.
To rate-limit Sync requests, preventing the same peer from touching the same time range would be a good thing. I feel, this and the one above should be the responsibility of Sync. @chaitanyaprem WDYT? |
For time range nego. I feel the simplest would be to always pick the smallest range. As for the closing Sync payload, it should include the server "need" hashes.
|
@chair28980 This should fall under Sync epic, not the Store v3 message hashes. |
Agreed, it should definitely consider the archive policy and should not overshoot it.
I think this should be the responsibility of StoreSync. I am looking at StoreSync as a protocol that helps store nodes to be in sync with each other. So, it would indicate how Store nodes can be in sync considering various scenarios such as this one.
Agreed, this doesn't need to be part of first iteration of StoreSync. But i see that this would be requried eventually.
To start with maybe we can just let Sync run in parallel to gossip and see. But i see that this would just add redundant traffic in the network, which is why i think we should try to optimize/improve this so that Sync doesn't kick-in for messages that are in-flight due to gossipsub's own IWANT/IHAVE interaction. But then again, this is more of an optimization and can be taken-up in second iteration.
To start with this is fine. Maybe as part of next iteration we can include such capabilities.
I agree that rate-limiting can be part of sync protocol implementation itself.
|
Did not get this. Do you mean when Store purges/deletes the messages they should be deleted from sync storage as well? Wrt storing sync data in-memory, i agree that we should switch to using LMDB since it is already supported. |
Agree with this flow, i too kind of envisoned something like this but with some differences. Will analyze the pros and cons of the approaches of using a subrange vs separate storage per time period for syncData. |
I have given some thought about using SubRange vs using separate storages itself and i think it is best we follow the approach of using separate storages. Approach:
Pros of this approach:
One change we would have to make to the flow of the Sync protocol with this approach, where-in the range will always be based on 1 storage duration. This is under assumption that 2 nodes will not have to sync using a sub-range at all. This should be good to start with. Later we can enhance with below if required:
I had thought briefly of the alternate approach of using subrange and i see following challenges:
Hence let us proceed with simplified approach of having a mechanism to keep creating new storages and deleting old ones (once they go out of sync duration). We can restrict sync also to single storage size per session. |
On this note, maybe we should also change archive paritioning logic to be in sync with this where-in partitions are created based on absolute hours of time such as 1PM , 2PM etc rather than relative to the node start time. Then we can get events from archive when a parition is being deleted, its associated sync storage can also be deleted. |
Good point! I'll read more on RLN I don't know what it verify exactly wrt timing. edit: https://github.com/vacp2p/rfc-index/blob/main/vac/32/rln-v1.md#external-nullifier |
I've been thinking and talked with @jm-clius about the sub-range vs separate Negentropy storage and I'm not sure it's the best course of action anymore. If we keep many separate We will always have a "roll-over" period where at least 2 sync must be done. The cost of local computation for removing elements from the I think we should use the sub-range functionality of Negentropy and remove old message hashes. A store query could be use to get the hashes and timestamps of messages older than 1H. We could also benefit from adding a new functionality to |
Do we see a need to merge these storages? This is only used for syncing and if sync duration is going to be 1 hour, then why would these need merging?
Considering the below explanation, would cost of removing elements be still small? As per negentropy spec, for 1 millino entries max-round trips is 3 to perform sync right. I am wondering if that would have soo much latency in comparison to removing all old entries.
It is not just the cost of removing elements from storage, we need to identify which elements to be removed which means we need to know the list of hashes for all elements to be removed which would be an overhead. Hence, having separate storage would address this problem where-in the storage can just be removed. |
No matter the time range of each IMO it would be simpler to, at interval, do a store query for message hashes older than 1h and remove those from the storage. It's all speculation at this point. I guess we could try both methods. |
Am I right in my understanding that the main issue here is actually the difficulty of removing old items from a negentropy storage? My understanding is you basically need a list of timestamps and hashes to remove after which you can proceed to erase each item individually. I guess it will be interesting to understand the relative costs/performance impact of:
To me it conceptually makes sense to have a longer-ranged Negentropy storage with occasional maintenance (to get rid of old items) and then to consider using a new SubRange storage for every regular sync. Unless we do some further secondary caching, I think we already know that we're likely to use SubRange as we never want to consider the latest messages (up to 20s in the past) during a sync. We expect some "real time" jitter for ~20s, while everything older than 20s can be considered settled. We can even ensure that Store Sync can only happen on pre-agreed 20s boundaries 3 times per minute so Sync nodes know exactly what subranges to be prepared for (i.e. at hh:mm:00, hh:mm:20, hh:mm:40). Now, what size SubRange is feasible to create three times per minute from a larger Negentropy storage? If it's less than 1 hour, should our regular sync be over an even shorter time duration, say 10min? Is there a way to do a partial sync over a Negentropy storage other than SubRange? If it's not feasible to create dynamic SubRanges, we may need our own in-memory list of hashes and timestamps, a queue if you will, which we use as index to both add and erase items from a single, managed Negentropy storage to ensure it only contains exactly the items we want to sync on. I'd also be interested in what the cost is to create ad-hoc larger Negentropy storages for e.g. multi-hour sync requests by querying the archive DB. This won't be a regular sync use case, but could be useful for Waku Store providers doing internal sync over longer time periods for a limited number of Store nodes. |
Yes or we could change the API Negentropy provide. The tree already store elements in order it would be trivial to impl. "erase element older than X" IMO.
The cost is multiple sync and the logic to handle
The cost is either do a store query for list of timestamp and hashes to remove or impl a new prune function into Negentropy.
Even if we don't do that syncing a couple of message extra is not a problem.
No, sub-range is the intended solution.
No way it would be that inefficient (famous last words).
It would take longer to query the DB than to build the tree I would bet. |
I too had this thought initially, but considering it is a B-Tree, I don't think it would be that trivial as removal of entries require rebalancing of the tree and I am not sure if pruning is going to be as simple. Also, another thing to note is once we move to DB backed storage such as lmdb, each node will have to removed separately which would be its own operation and i am not sure if there would be a simple way to prune the tree. But, then again there could be an algorithm to prune a B-Tree without removing elements 1 by 1. Could not find anything in a quick search. Will spend some more time again on this to see if there is any other alternative. |
We could build our own storage impl. if the deleting of KVs is too costly. Some kind of growable ring buffer maybe? |
Each delete operation on B-tree is supposed to be O(logn) complexity. But if we delete 1 by 1 in sequence, it will just result in a lot of tree rebalances. That coupled with the additions that would happen in parallel to the tree due to messages being received by the node would definitely require synchronization of these operations. i.e during prune which will be little longer operation additions either have to be blocked or deletion will have to happen in batches where additions are queued during deletion of a batch. This looks like it will lead to more complexity. Also considering our final solution will be to use a database (either lmdb or something else), having to delete entries would definitelybe more costly than maintaining multiple storages. How about an alternative approach of mixed use of multiple storages and SubRanges. What if we maintain separate storage for a longer duration (Maybe few hours) and use Subranges for syncing using the latest storage? This would have following benefits:
Also, i was taking a relook at code of StorageManager (that rotates and manages storage) and it doesn't look too complex to maintain. |
The problem is that the storage impl. is not adapted to our use case. The good news is that RBSR is agnostic to how elements are stored. Instead of mitigating the problem at a higher layer, let's actually fix the problem. The first and easiest thing to impl. would be to delete elements 1 by 1 so that the POC is somewhat usable in a long running node (inefficiently it may be) soon. We then follow with storage rotation to mitigate the problem but we also work on a actual fix in the form of a new storage impl. At that point, we could maybe benefit from swapping the C++ code for Rust or Nim since, we have to impl. a new storage and it would make interop easier too. |
It does indeed seem like the storage implementation is too limited. However, if we can get the simplest POC solution implemented (even if it does a fairly expensive erase operation) we can evaluate the actual impact of the three avenues available:
|
Not sure if that is completely true. I mean from a protocol design it is.IIRC, it also considers a form of hashing that is tree-friendly so that by adding a single element in the tree doesn't result in recomputing hashes all again for whole tree. The underlying persistence of the tree is flexible in the sense, we can easily store the tree in a postgres DB. I had noticed that this is going to be very easy job after going through the tree implementation. But changing the data-structure itself for storage is not going to be an easy task i think.
Agreed, and this would be easier done from top level so that any synchronization can be done at application (i.e in this case store sync level). |
In the context of the recent focus on reliability, I've been thinking of more use cases for RBSR. We could replace light push and reliability checks by syncing. Since we need to verify that the messages have been sent correctly, a sync session could be used to do the "push" and the verification at the same time. Node A would put a new message in it's own local cache then initiate a sync with node B. This would verify that the previous messages were sent correctly and also send the new message at the same time. Node B would have to keep a sub-set for each of it's light push clients. The same logic and flow could also be use to replace the Filter protocol. A sync could be initiated from node B when it has received one or more messages that are interesting for one of it's light clients (IE part of their sub-set). If we assume that nodes, light and "full" continuously sync with each other at opportune moments and each keep a set of the messages they are interested in, the same code can be used everywhere. With the current impl. of Waku Sync it would be possible but awkward as it is not possible to efficiently create sub-sets, each cache would have to be managed on it's own and duplicate data. A better impl. would allow the creation of light weight sub-sets or to sync with set sorted by more than just time (probably content topic). This would preserve the status-quo in term of privacy and increase bandwidth slightly while reducing overall complexity. |
Waku Sync
In the Waku family of protocols Relay is responsible for the dissemination of messages but cannot guarantee delivery. The push based propagation (PubSub) of Relay can be augmented with a pull based method, in our case Sync. It's purpose is to help with network wide consistency of the set of messages stored. The Store protocol is also related as it is used to query peers for messages.
Protocol
The Sync protocol will be used to merge the set of messages two Store nodes posses. As not all nodes are interested in all messages it is important for Sync to take into consideration different attributes; content topics, timestamps, shards and message hashes. Messages transmission can be left to Store, synchronizing hashes will suffice. Sync is also not concerned with how messages are stored or their validity, syncing with trusted peers is assumed.
Solutions
The basic requirement is to find the differences between 2 sets of message hashes. In addition to this, limiting this operation to messages with certain attributes would fit common use cases where only a sub-set of message hashes need to be synced. The bandwidth, number of requests and responses and the efficiency of the computation are all factors to consider but we can expect network latency to be the biggest performance factor
Prolly Trees
At the highest level, a Prolly tree is an immutable, ordered collection of either keys and values or keys only (a set). The tree is constructed in a similar fashion to Merkle trees but does not care about order of operation. The same elements will always result in the same tree. There is currently no known general purpose Prolly tree implementation we could use.
Tree Structure
Starting at the leaves, all the elements are first ordered by theirs keys and then feed into a chunking algorithm. This algorithm decide the size of each leaf nodes by selecting an element as the boundary. All the boundary elements are then ordered again and feed to the algorithm. This process is repeated for each level of the tree until only one node remains, the root. To traverse the tree, non-leaves node elements store the hash of the node at the lower level they are a boundary of.
The number of elements a node store is probabilistic and varies based on the choice of chunking algorithm and it's parameters.
E.G.
lvl 2 AQ -> each element store the hash of a node below
lvl 1 AEHM, QUX -> store hashes of the nodes below
lvl 0 ABCD, EFG, HIJKL, MNOP, QRST, UVW, XYZ -> each element store values
Set Operations
To efficiently find differences between 2 trees, the keys order and nodes hash equivalence are used to determine which path to follow or not. Starting at the root if 2 elements have the same hash then they can be ignored. If the hash differ, elements of nodes at the lower level have to be checked recursively. The elements are checked in order and are added to the differences if they are not shared between the 2 trees.
References
Future Work
Prolly trees can be used to search vast amount of data efficiently. With this technology we could build a database that is proveable, replicatable and decentralized. See this experiment.
Range Based Set Reconciliation
A good way to think about RBSR is to imagine an iterative binary search between 2 peers. Keys and values are first ordered and then ranges of elements are hashed and exchanged between the peers. Identical ranges are ignored and different ones are sub-divided into smaller ranges recursively. Current implementations includes C++, JS and Rust. JS and Rust only support in-memory vectors as storage with C++ adding B trees and LMDB.
Divide and Conquer
Actual implementation can split ranges into more than 2 sub-ranges. It allow smaller payload to be sent but increases the number of round trips between 2 peers or more ranges can be sent, resulting in a bigger payload with fewer round trips. Ranges don't have to be identical either, if payloads are limited to a certain size, ranges can stay under the limit imposed.
Hashing
Incremental hashing is used, instead of recomputing the hash of a range each time, the hash of each elements are XOR together. This method makes inserting or removing elements more efficient when using a tree-based storage. Hash the element to insert then XOR with the other hashes. Deletions are done by hashing the elements, reversing the bits then XOR with other hashes.
References
Future work
Multiple ranges can be combined to sync specific sub-sets of elements. See this experiment. Set reconciliation could be done with more than 3 "dimensions"
Prollies VS RBSR
State
To sync 2 prolly trees a snapshot of both side needs to be maintained while the sync is ongoing. This is made easier by the fact that prolly trees are immutable but it also means that more state needs to be stored. Some kind of garbage collector would need to clean up old version of the local tree that are no longer in use. Updating the local tree would also require a message buffer so that updates are done on batches of new messages instead of every new message for efficiency.
This delay between new message being added and synced is not a problem for RBSR. While the set reconciliation is taking place new messages can be added to the underlying storage. If any new messages fall in a range that is yet to be sent then they will be synced the same way as if they were there before the reconciliation started.
Also, Prolly trees are only probabilistic balanced but for RBSR the tree storage can be rebalanced and packed because it is not part of the protocol.
Multi-party Sync
Both method could sync with multiple peers at the same time.
For RBSR, a client would sends the initial payloads to multiple "servers" and then respond to each server payload independently. Care must be taken if one wants to differentiate which "server" has which hashes, if not a single list of missing hashes is populated.
For Prolly trees, every "server" would walk their own tree and send the differences to the "client". In turn the client would have to remember each "server" differences and walk each path separately. Because the trees are immutable each sync would be independent from each other.
Overall, RBSR multi-party sync is simpler because the state required to track each sync "session" is less.
Ease of development
RBSR are implemented in the form of the Negentropy protocol. Some work would be needed to adapt it to our needs but much less than Prolly trees as no implementation is available.
The text was updated successfully, but these errors were encountered: