Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waku Sync #80

Open
Tracked by #104
SionoiS opened this issue Jan 31, 2024 · 40 comments
Open
Tracked by #104

Waku Sync #80

SionoiS opened this issue Jan 31, 2024 · 40 comments

Comments

@SionoiS
Copy link

SionoiS commented Jan 31, 2024

Waku Sync

In the Waku family of protocols Relay is responsible for the dissemination of messages but cannot guarantee delivery. The push based propagation (PubSub) of Relay can be augmented with a pull based method, in our case Sync. It's purpose is to help with network wide consistency of the set of messages stored. The Store protocol is also related as it is used to query peers for messages.

Protocol

The Sync protocol will be used to merge the set of messages two Store nodes posses. As not all nodes are interested in all messages it is important for Sync to take into consideration different attributes; content topics, timestamps, shards and message hashes. Messages transmission can be left to Store, synchronizing hashes will suffice. Sync is also not concerned with how messages are stored or their validity, syncing with trusted peers is assumed.

Solutions

The basic requirement is to find the differences between 2 sets of message hashes. In addition to this, limiting this operation to messages with certain attributes would fit common use cases where only a sub-set of message hashes need to be synced. The bandwidth, number of requests and responses and the efficiency of the computation are all factors to consider but we can expect network latency to be the biggest performance factor

Prolly Trees

At the highest level, a Prolly tree is an immutable, ordered collection of either keys and values or keys only (a set). The tree is constructed in a similar fashion to Merkle trees but does not care about order of operation. The same elements will always result in the same tree. There is currently no known general purpose Prolly tree implementation we could use.

Tree Structure

Starting at the leaves, all the elements are first ordered by theirs keys and then feed into a chunking algorithm. This algorithm decide the size of each leaf nodes by selecting an element as the boundary. All the boundary elements are then ordered again and feed to the algorithm. This process is repeated for each level of the tree until only one node remains, the root. To traverse the tree, non-leaves node elements store the hash of the node at the lower level they are a boundary of.
The number of elements a node store is probabilistic and varies based on the choice of chunking algorithm and it's parameters.

E.G.
lvl 2 AQ -> each element store the hash of a node below
lvl 1 AEHM, QUX -> store hashes of the nodes below
lvl 0 ABCD, EFG, HIJKL, MNOP, QRST, UVW, XYZ -> each element store values

Set Operations

To efficiently find differences between 2 trees, the keys order and nodes hash equivalence are used to determine which path to follow or not. Starting at the root if 2 elements have the same hash then they can be ignored. If the hash differ, elements of nodes at the lower level have to be checked recursively. The elements are checked in order and are added to the differences if they are not shared between the 2 trees.

References

Future Work

Prolly trees can be used to search vast amount of data efficiently. With this technology we could build a database that is proveable, replicatable and decentralized. See this experiment.

Range Based Set Reconciliation

A good way to think about RBSR is to imagine an iterative binary search between 2 peers. Keys and values are first ordered and then ranges of elements are hashed and exchanged between the peers. Identical ranges are ignored and different ones are sub-divided into smaller ranges recursively. Current implementations includes C++, JS and Rust. JS and Rust only support in-memory vectors as storage with C++ adding B trees and LMDB.

Divide and Conquer

Actual implementation can split ranges into more than 2 sub-ranges. It allow smaller payload to be sent but increases the number of round trips between 2 peers or more ranges can be sent, resulting in a bigger payload with fewer round trips. Ranges don't have to be identical either, if payloads are limited to a certain size, ranges can stay under the limit imposed.

Hashing

Incremental hashing is used, instead of recomputing the hash of a range each time, the hash of each elements are XOR together. This method makes inserting or removing elements more efficient when using a tree-based storage. Hash the element to insert then XOR with the other hashes. Deletions are done by hashing the elements, reversing the bits then XOR with other hashes.

References

Future work

Multiple ranges can be combined to sync specific sub-sets of elements. See this experiment. Set reconciliation could be done with more than 3 "dimensions"

Prollies VS RBSR

State

To sync 2 prolly trees a snapshot of both side needs to be maintained while the sync is ongoing. This is made easier by the fact that prolly trees are immutable but it also means that more state needs to be stored. Some kind of garbage collector would need to clean up old version of the local tree that are no longer in use. Updating the local tree would also require a message buffer so that updates are done on batches of new messages instead of every new message for efficiency.

This delay between new message being added and synced is not a problem for RBSR. While the set reconciliation is taking place new messages can be added to the underlying storage. If any new messages fall in a range that is yet to be sent then they will be synced the same way as if they were there before the reconciliation started.

Also, Prolly trees are only probabilistic balanced but for RBSR the tree storage can be rebalanced and packed because it is not part of the protocol.

Multi-party Sync

Both method could sync with multiple peers at the same time.

For RBSR, a client would sends the initial payloads to multiple "servers" and then respond to each server payload independently. Care must be taken if one wants to differentiate which "server" has which hashes, if not a single list of missing hashes is populated.

For Prolly trees, every "server" would walk their own tree and send the differences to the "client". In turn the client would have to remember each "server" differences and walk each path separately. Because the trees are immutable each sync would be independent from each other.

Overall, RBSR multi-party sync is simpler because the state required to track each sync "session" is less.

Ease of development

RBSR are implemented in the form of the Negentropy protocol. Some work would be needed to adapt it to our needs but much less than Prolly trees as no implementation is available.

@SionoiS SionoiS self-assigned this Jan 31, 2024
@jm-clius
Copy link
Contributor

jm-clius commented Feb 8, 2024

As not all nodes are interested in all messages it is important for Sync to take into consideration different attributes; content topics, timestamps, shards and message hashes.

I would focus for now on a simplifying assumption that Sync is between two nodes that already "know" they want to reconcile all cached message hashes - in other words, sync everything. Not only is this how Sync will first function (with the exception of perhaps a sync list per shard), but it focuses our initial efforts into a building block that can be adapted for all these use cases.

Sync is also not concerned with how data is stored

Well, it does assume a key-value store with unique addressable keys, message hashes in our case.

instead of recomputing all hashes each time, the hash of each elements are XOR together.

But when a diff is found the fingerprint for the "dynamic" subranges would have to be recomputed each time, or am I missing something? This remains my main concern for RBSR scalability if frequently syncing with multiple peers for an ever-growing data set. Not a showstopper, but scalability may be something to model or explicitly verify in experiments.

Overall, RBSR multi-party sync is simpler

In our case, assuming regular syncing, most reconciliation would happen on the "right side" for the most recent messages. Presumably this has approximately equal impact on RBSR and Prolly Trees (?), but worth keeping this in mind when considering options.

RBSR are implemented in the form of the Negentropy protocol

If this is a simple(ish) integration, this would already be good enough reason to build POC using RBSR above Prolly Trees IMO.

@SionoiS
Copy link
Author

SionoiS commented Feb 8, 2024

I would focus for now on a simplifying assumption that Sync is between two nodes that already "know" they want to reconcile all cached message hashes - in other words, sync everything. Not only is this how Sync will first function (with the exception of perhaps a sync list per shard), but it focuses our initial efforts into a building block that can be adapted for all these use cases.

💯

Sync is also not concerned with how data is stored

Well, it does assume a key-value store with unique addressable keys, message hashes in our case.

RBSR can function well without a DB. I'll rephrase, I meant in the sense that Sync only deals in hashes not messages themselves.

But when a diff is found the fingerprint for the "dynamic" subranges would have to be recomputed each time, or am I missing something? This remains my main concern for RBSR scalability if frequently syncing with multiple peers for an ever-growing data set. Not a showstopper, but scalability may be something to model or explicitly verify in experiments.

Messages are already hashed. At no point hashing re-occurs. The "hash" of a range is the XOR (plus hardening) of all the message hashes.

In our case, assuming regular syncing, most reconciliation would happen on the "right side" for the most recent messages. Presumably this has approximately equal impact on RBSR and Prolly Trees (?), but worth keeping this in mind when considering options.

Yes, good point! RBSR can limit the sync to the most recent messages as well as Prolly trees.

RBSR are implemented in the form of the Negentropy protocol

If this is a simple(ish) integration, this would already be good enough reason to build POC using RBSR above Prolly Trees IMO.

I estimate that ~4-6 months of work would be required to make a Prolly tree Sync. If we use Negentropy 3-4 weeks might be enough.

@chaitanyaprem
Copy link

chaitanyaprem commented Feb 14, 2024

This method makes inserting or removing elements more efficient when using a tree-based storage. Hash the element to insert then XOR with the other hashes. Deletions are done by hashing the elements, reversing the bits then XOR with other hashes.

In case of Waku sync, deletion would not be a use-case right as only stored messages are being synced across and a node will only ever be missing or having additional messages.
In a case it is missing it needs to sync those from peer, and if it is having extra the peer needs those messages.
Or do you see some scenario where something already existing has to be deleted during sync?

@jm-clius
Copy link
Contributor

deletion would not be a use-case right

Correct. No deletion (only removing oldest entries based on dimensioning) and no modification of entries.

@SionoiS
Copy link
Author

SionoiS commented Mar 20, 2024

I've been thinking of adding a new protocol. Let's call it StoreSync. it would be useful in the case when Sync returns have hashes. ATM, we can't send or tell the other node about those. Also, we can't negotiate sync range or frame size between nodes.

The new protocol would be used for this. The idea would be to send our desired max frame size and sync range then wait for a response. We then compare the response with ours and pick the smallest of the values and initiate sync. Once Sync is done we send the have hashes to the other node if any, then we request our need hashes via Store from the other node.

@jm-clius
Copy link
Contributor

it would be useful in the case when Sync returns have hashes. ATM, we can't send or tell the other node about those.

Not sure I follow - isn't this exactly what negentropy sync does? return a diff list of message hashes?

we can't negotiate sync range or frame size between nodes.

Perhaps sync range can be a network specification, but indeed - we probably need two nodes to agree on the exact time range to sync for this to work. However, this sounds to me like an extension of the Sync protocol itself rather than a new protocol? (I.e. a wrapper and extension on negentropy)? What do you mean by "frame size"?

@SionoiS
Copy link
Author

SionoiS commented Mar 20, 2024

Not sure I follow - isn't this exactly what negentropy sync does? return a diff list of message hashes?

Yes but there's 2 lists; hashes I want, hashes the other node needs.

Perhaps sync range can be a network specification, but indeed - we probably need two nodes to agree on the exact time range to sync for this to work. However, this sounds to me like an extension of the Sync protocol itself rather than a new protocol? (I.e. a wrapper and extension on negentropy)? What do you mean by "frame size"?

Maybe just an extention would work and maxFramSize is the param that can limit the size of payloads.

@chaitanyaprem
Copy link

chaitanyaprem commented Mar 21, 2024

Perhaps sync range can be a network specification, but indeed - we probably need two nodes to agree on the exact time range to sync for this to work. However, this sounds to me like an extension of the Sync protocol itself rather than a new protocol? (I.e. a wrapper and extension on negentropy)? What do you mean by "frame size"?

I too think this should be the approach where-in we just create a wrapper protocol inside which the negentropy payload is a field. Other fields can be time range of sync, max-frame-size for now.
In future we can include parameters like shards and/or content-topics as well.

If this is the case i am wondering if the protocol sequence would be something like this:

  1. Client sends a request to sync with params like maxFrameSize, timerange etc.
  2. Server acks the params that it can support sync for these.
  3. Client would then start the negentropy sync based on agreed params and then client-server would continue as they do now.

@SionoiS In this case, we may want the server side also to create a new negentropy instance for each sync request it handles. Because the frameSize is set during creation of negentropy instance (similar to how we are doing for each sync method call on client side). The storage would remain same across all instances.
But this may not be required to be implemented immediately, as the first usage wouldbe for TWN where we would expect this param to be hard-coded for the network.

@chaitanyaprem
Copy link

Yes but there's 2 lists; hashes I want, hashes the other node needs.

But this doesn't need to be communicated separately right. During the negentropy sync procedure itself both client and server can know what hashes they need and the other have. This can be achieved by simple using the reconcile method that takes in have and needHashes on server-side as well. We don't need to communicate this separately from client to server.

@chaitanyaprem
Copy link

chaitanyaprem commented Mar 21, 2024

I've been thinking of adding a new protocol. Let's call it StoreSync. it would be useful in the case when Sync returns have hashes. ATM, we can't send or tell the other node about those. Also, we can't negotiate sync range or frame size between nodes.

I think the StoreSync protocol should define something like below:

  • Sync protocol would be used for syncing between store nodes periodically and sync-time should not be more than x hours
  • Any node that falls behind more than x hours should not use the sync protocol, rather use historyQuery to fetch the messages and then in-turn start using sync protocol for periodic syncing. Note that once incentivization of Store protocol is implemented, then historyQuery would involve payment to the node from which history is being queried.
  • It can define parameters like how many store nodes to sync with periodically. It can keep track of various sync scopes and periodicity for each of them in future (like which shards to be synced with which nodes etc).
  • It would also manage negentropy storage i.e should be maintained on x time period basis and delete it.
  • It would also define when sync for a certain time-range should be initiated so that we don't duplicate work done at gossip layer where nodes ask each other for messages.
  • It should also maintain some sort of a state as to what was the last point that was sycned(with mutliple nodes) i.e no need to sync a particular range etc.
  • rate-limiting sync requests to handle (Maybe this would be part of sync protocol itself)

@SionoiS
Copy link
Author

SionoiS commented Mar 21, 2024

Yes but there's 2 lists; hashes I want, hashes the other node needs.

But this doesn't need to be communicated separately right. During the negentropy sync procedure itself both client and server can know what hashes they need and the other have. This can be achieved by simple using the reconcile method that takes in have and needHashes on server-side as well. We don't need to communicate this separately from client to server.

Are you sure? From what I can see of the code, only the initiator can use the reconcile with list as output. This restriction seams arbitrary but there's must be a good reason for it...

Maybe we could try making some modification to the original code?

@SionoiS
Copy link
Author

SionoiS commented Mar 21, 2024

What I see is that if we want to verify the messages we receive we can't just use Sync and Store, it fits either scope of responsibility. That is why I was suggesting a new protocol.

Maybe RLN-Relay could expose a proc that can be used to verify any messages? We can't verify them without RLN anyway.

@chaitanyaprem
Copy link

What I see is that if we want to verify the messages we receive we can't just use Sync and Store, it fits either scope of responsibility. That is why I was suggesting a new protocol.

Maybe RLN-Relay could expose a proc that can be used to verify any messages? We can't verify them without RLN anyway.

I too agree we need a StoreSync protocol a layer above sync protocol. I tried to list down its functions here.

But I am wondering if Message verification can be part of Store protocol as any client requesting messages from a StoreNode can verify if needed. But then again, if a client(other than storeNode) is requesting, it can validate using application logic as well. So, maybe we can keep it restricted to StoreSync to start with.

@chaitanyaprem
Copy link

chaitanyaprem commented Mar 22, 2024

Yes but there's 2 lists; hashes I want, hashes the other node needs.

But this doesn't need to be communicated separately right. During the negentropy sync procedure itself both client and server can know what hashes they need and the other have. This can be achieved by simple using the reconcile method that takes in have and needHashes on server-side as well. We don't need to communicate this separately from client to server.

Are you sure? From what I can see of the code, only the initiator can use the reconcile with list as output. This restriction seams arbitrary but there's must be a good reason for it...

Maybe we could try making some modification to the original code?

Ok, i see it now in the code. But wondering what could be the reason to differentiate initiator and non-initiator in this case.

Looks like as per negentropy, initiator is the one that informs the other side of elements that other side doesn't have.
Qouting from https://logperiodic.com/rbsr.html#negentropy

The initiator notices that it was missing one element, and adds it to its set (illustrated by colouring blue). It also notices that it has two elements that were not sent by the remote side, so it sends these elements and the remote side adds them to its set:

Maybe it is better to not modify this and rather take the approach like you have suggested, i.e the sync protocol can then inform the peer that it has haveIds which the peer doesn't have.

@SionoiS
Copy link
Author

SionoiS commented Mar 25, 2024

  • Sync protocol would be used for syncing between store nodes periodically and sync-time should not be more than x hours

Yes, but the x in this case is also a factor of the chosen archive policy. If archive keep the last 30m then Sync can't be 1h.
We can either assume Store keep more than x hours of messages OR negotiate the time range before syncing.

  • Any node that falls behind more than x hours should not use the sync protocol, rather use historyQuery to fetch the messages and then in-turn start using sync protocol for periodic syncing. Note that once incentivization of Store protocol is implemented, then historyQuery would involve payment to the node from which history is being queried.

Yes when a Store node start it should check how long it's been offline. Maybe Store should be responsible for that?

  • It can define parameters like how many store nodes to sync with periodically. It can keep track of various sync scopes and periodicity for each of them in future (like which shards to be synced with which nodes etc).

Yes but this seams lower priority.

  • It would also manage negentropy storage i.e should be maintained on x time period basis and delete it.

Sync should talk to Store and keep the set of messages the same. On the technical side we should use 1 DB if possible. Having an in-memory storage for Sync seams not necessary. Let's store the negentropy tree in the DB instead.

  • It would also define when sync for a certain time-range should be initiated so that we don't duplicate work done at gossip layer where nodes ask each other for messages.

If gossips were reliable we would not have to create Sync. Work duplication seams inevitable?

  • It should also maintain some sort of a state as to what was the last point that was sycned(with mutliple nodes) i.e no need to sync a particular range etc.

Yes, but for now lets just sync with a random peer at certain interval.

  • rate-limiting sync requests to handle (Maybe this would be part of sync protocol itself)

To rate-limit Sync requests, preventing the same peer from touching the same time range would be a good thing. I feel, this and the one above should be the responsibility of Sync.

@chaitanyaprem WDYT?

@SionoiS
Copy link
Author

SionoiS commented Mar 25, 2024

For time range nego. I feel the simplest would be to always pick the smallest range.

As for the closing Sync payload, it should include the server "need" hashes.

  1. client payload
time range start
time range end
max frame size
  1. server payload (same as above)
  2. client use the smallest range and initiate sub-range storage.
  3. client create first payload with smallest frame size.
  4. round-trips of negentropy payloads
  5. client sync is complete and now send last payload containing server "needs" hashes if any then done.
  6. server receive hashes it needs from the client if any and then done.

@chaitanyaprem
Copy link

@chair28980 This should fall under Sync epic, not the Store v3 message hashes.

@chaitanyaprem
Copy link

chaitanyaprem commented Mar 26, 2024

  • Sync protocol would be used for syncing between store nodes periodically and sync-time should not be more than x hours

Yes, but the x in this case is also a factor of the chosen archive policy. If archive keep the last 30m then Sync can't be 1h. We can either assume Store keep more than x hours of messages OR negotiate the time range before syncing.

Agreed, it should definitely consider the archive policy and should not overshoot it.

  • Any node that falls behind more than x hours should not use the sync protocol, rather use historyQuery to fetch the messages and then in-turn start using sync protocol for periodic syncing. Note that once incentivization of Store protocol is implemented, then historyQuery would involve payment to the node from which history is being queried.

Yes when a Store node start it should check how long it's been offline. Maybe Store should be responsible for that?

I think this should be the responsibility of StoreSync. I am looking at StoreSync as a protocol that helps store nodes to be in sync with each other. So, it would indicate how Store nodes can be in sync considering various scenarios such as this one.

  • It can define parameters like how many store nodes to sync with periodically. It can keep track of various sync scopes and periodicity for each of them in future (like which shards to be synced with which nodes etc).

Yes but this seams lower priority.

Agreed, this doesn't need to be part of first iteration of StoreSync. But i see that this would be requried eventually.

  • It would also manage negentropy storage i.e should be maintained on x time period basis and delete it.

Sync should talk to Store and keep the set of messages the same. On the technical side we should use 1 DB if possible. Having an in-memory storage for Sync seams not necessary. Let's store the negentropy tree in the DB instead.

  • It would also define when sync for a certain time-range should be initiated so that we don't duplicate work done at gossip layer where nodes ask each other for messages.

If gossips were reliable we would not have to create Sync. Work duplication seams inevitable?

To start with maybe we can just let Sync run in parallel to gossip and see. But i see that this would just add redundant traffic in the network, which is why i think we should try to optimize/improve this so that Sync doesn't kick-in for messages that are in-flight due to gossipsub's own IWANT/IHAVE interaction. But then again, this is more of an optimization and can be taken-up in second iteration.

  • It should also maintain some sort of a state as to what was the last point that was sycned(with mutliple nodes) i.e no need to sync a particular range etc.

Yes, but for now lets just sync with a random peer at certain interval.

To start with this is fine. Maybe as part of next iteration we can include such capabilities.

  • rate-limiting sync requests to handle (Maybe this would be part of sync protocol itself)

To rate-limit Sync requests, preventing the same peer from touching the same time range would be a good thing. I feel, this and the one above should be the responsibility of Sync.

I agree that rate-limiting can be part of sync protocol implementation itself.
But wrt maintaining state of sync with a particular peer, i think it is better to be part of a higher level protocol like StoreSync. Will think little more on this though.

@chaitanyaprem WDYT?

@chaitanyaprem
Copy link

  • It would also manage negentropy storage i.e should be maintained on x time period basis and delete it.

Sync should talk to Store and keep the set of messages the same. On the technical side we should use 1 DB if possible. Having an in-memory storage for Sync seams not necessary. Let's store the negentropy tree in the DB instead.

Did not get this. Do you mean when Store purges/deletes the messages they should be deleted from sync storage as well?
I don't know currently how that is done, but i think @Ivansete-status is working on changing this using partitions where-in instead of deleting each message once time limit crosses, a partition is removed. I was thinking we should follow something similar for Sync storage as well. i.e maintain storage separately for different time-ranges and keep deleting older storage once it crosses the time we want to sync.

Wrt storing sync data in-memory, i agree that we should switch to using LMDB since it is already supported.
Since Store/Archive uses postgreSQL, I don't think it is a good idea for the Sync to use postgres because Sync uses a BTree for which a key-value kind of database is more suited. So, we can start by using lmdb for storing sync data and at a later stage think of which db to use. RLN also uses another DB to store its data (which is also a tree), maybe we can use a single DB to store both syncData and RLN (not sure but this needs some study).

@chaitanyaprem
Copy link

For time range nego. I feel the simplest would be to always pick the smallest range.

As for the closing Sync payload, it should include the server "need" hashes.

1. client payload
time range start
time range end
max frame size
2. server payload (same as above)

3. client use the smallest range and initiate sub-range storage.

4. client create first payload with smallest frame size.

5. round-trips of negentropy payloads

6. client sync is complete and now send last payload containing server "needs" hashes if any then done.

7. server receive hashes it needs from the client if any and then done.

Agree with this flow, i too kind of envisoned something like this but with some differences.
We need to evaluate whether to use sub-range for syncing or maintain different storage(s) for different time ranges.

Will analyze the pros and cons of the approaches of using a subrange vs separate storage per time period for syncData.

@chaitanyaprem
Copy link

chaitanyaprem commented Mar 27, 2024

I have given some thought about using SubRange vs using separate storages itself and i think it is best we follow the approach of using separate storages.
Below is the simplest approach using separate storages and its pros.

Approach:

  1. Each node shall always create a new storage based on duration (e.g: maintain separate storage for every 1 hour). i.e a node shall keep creating new storage for let's say 12PM and 1PM etc.
  2. When a node starts for the first time, it can sync with other nodes in the network to have the data for the current hour (e.g if a node is started at 12:10, it would sync with a peer to get messages from 12PM).
  3. We can follow an approach similar to partition where-in the future storages are created in advance. Messages are stored in appropriate storage based on timestamp.

Pros of this approach:

  1. A storage can be deleted once its appropriate archive partition is removed or if it crosses the threshold for sync. (There can be cases where a node wants to archive 30 days of data but only support syncing of recent 2 hours of data)
  2. We can avoid usage of sub-range completely by this approach as the server side of negentropy already supports inherently subrange (based on initiator's selected range)
  3. If we have to take a sub-range approach then, a single storage would be maintained and purging of older elements can only be done 1 by 1 which will be time-consuming and less performant.

One change we would have to make to the flow of the Sync protocol with this approach, where-in the range will always be based on 1 storage duration. This is under assumption that 2 nodes will not have to sync using a sub-range at all. This should be good to start with. Later we can enhance with below if required:

  1. Allow a node to sync more than 1 range (i.e more than 1 storages or 1 hour of data)
  2. Allow a node to sync less than 1 hour/storage of data (for this subrange would be required)

I had thought briefly of the alternate approach of using subrange and i see following challenges:

  1. If we are maintaining single storage and use sub-range everytime for syncing, then deleting items from storage would be costly as each item has to be deleted separately. This would be a challenge as we need to know the hash and timestamp of each item to remove it. Currently the archive itself deletes items from DB with a single timerange query and not item by item.
  2. It would lead to more complex flows to handle which may not be required to being with for storeSync.
  3. In future, when we want to store based on pubsubTopic and/or contentTopic, this approach would not be useful as anyways separate storages would be required at that time. This is because sync would be done for specific pubsubTopic or specific list of contentTopics.

Hence let us proceed with simplified approach of having a mechanism to keep creating new storages and deleting old ones (once they go out of sync duration). We can restrict sync also to single storage size per session.

@chaitanyaprem
Copy link

chaitanyaprem commented Mar 27, 2024

We can follow an approach similar to partition where-in the future storages are created in advance. Messages are stored in appropriate storage based on timestamp.

On this note, maybe we should also change archive paritioning logic to be in sync with this where-in partitions are created based on absolute hours of time such as 1PM , 2PM etc rather than relative to the node start time. Then we can get events from archive when a parition is being deleted, its associated sync storage can also be deleted.

cc @NagyZoltanPeter @Ivansete-status

@SionoiS
Copy link
Author

SionoiS commented Mar 27, 2024

I am assuming we can trust them since they are meant to be validly published during that epoch for which a user has permission to send.

Good point! I'll read more on RLN I don't know what it verify exactly wrt timing.

edit: https://github.com/vacp2p/rfc-index/blob/main/vac/32/rln-v1.md#external-nullifier

@SionoiS
Copy link
Author

SionoiS commented Apr 22, 2024

I've been thinking and talked with @jm-clius about the sub-range vs separate Negentropy storage and I'm not sure it's the best course of action anymore.

If we keep many separate storage we can't merge them and so will need to initiate multiple sync (one per storage).

We will always have a "roll-over" period where at least 2 sync must be done.

The cost of local computation for removing elements from the storage is small compared to the bandwidth and network latency of an extra sync.

I think we should use the sub-range functionality of Negentropy and remove old message hashes. A store query could be use to get the hashes and timestamps of messages older than 1H.

We could also benefit from adding a new functionality to storage to prune elements based on timestamps.

@chaitanyaprem
Copy link

If we keep many separate storage we can't merge them and so will need to initiate multiple sync (one per storage).

Do we see a need to merge these storages? This is only used for syncing and if sync duration is going to be 1 hour, then why would these need merging?

The cost of local computation for removing elements from the storage is small compared to the bandwidth and network latency of an extra sync.

Considering the below explanation, would cost of removing elements be still small? As per negentropy spec, for 1 millino entries max-round trips is 3 to perform sync right. I am wondering if that would have soo much latency in comparison to removing all old entries.

The cost of local computation for removing elements from the storage is small compared to the bandwidth and network latency of an extra sync.

It is not just the cost of removing elements from storage, we need to identify which elements to be removed which means we need to know the list of hashes for all elements to be removed which would be an overhead. Hence, having separate storage would address this problem where-in the storage can just be removed.

@SionoiS
Copy link
Author

SionoiS commented Apr 22, 2024

Do we see a need to merge these storages? This is only used for syncing and if sync duration is going to be 1 hour, then why would these need merging?

No matter the time range of each storage, at some point the last 1h will have to cross into another storage no?

IMO it would be simpler to, at interval, do a store query for message hashes older than 1h and remove those from the storage.

It's all speculation at this point. I guess we could try both methods.

@jm-clius
Copy link
Contributor

Am I right in my understanding that the main issue here is actually the difficulty of removing old items from a negentropy storage? My understanding is you basically need a list of timestamps and hashes to remove after which you can proceed to erase each item individually.

I guess it will be interesting to understand the relative costs/performance impact of:

  • creating new Negentropy storages dynamically
  • creating SubRanges dynamically from existing Negentropy storage

To me it conceptually makes sense to have a longer-ranged Negentropy storage with occasional maintenance (to get rid of old items) and then to consider using a new SubRange storage for every regular sync. Unless we do some further secondary caching, I think we already know that we're likely to use SubRange as we never want to consider the latest messages (up to 20s in the past) during a sync. We expect some "real time" jitter for ~20s, while everything older than 20s can be considered settled. We can even ensure that Store Sync can only happen on pre-agreed 20s boundaries 3 times per minute so Sync nodes know exactly what subranges to be prepared for (i.e. at hh:mm:00, hh:mm:20, hh:mm:40). Now, what size SubRange is feasible to create three times per minute from a larger Negentropy storage? If it's less than 1 hour, should our regular sync be over an even shorter time duration, say 10min? Is there a way to do a partial sync over a Negentropy storage other than SubRange?

If it's not feasible to create dynamic SubRanges, we may need our own in-memory list of hashes and timestamps, a queue if you will, which we use as index to both add and erase items from a single, managed Negentropy storage to ensure it only contains exactly the items we want to sync on.

I'd also be interested in what the cost is to create ad-hoc larger Negentropy storages for e.g. multi-hour sync requests by querying the archive DB. This won't be a regular sync use case, but could be useful for Waku Store providers doing internal sync over longer time periods for a limited number of Store nodes.

@SionoiS
Copy link
Author

SionoiS commented Apr 25, 2024

Am I right in my understanding that the main issue here is actually the difficulty of removing old items from a negentropy storage? My understanding is you basically need a list of timestamps and hashes to remove after which you can proceed to erase each item individually.

Yes or we could change the API Negentropy provide. The tree already store elements in order it would be trivial to impl. "erase element older than X" IMO.

  • creating new Negentropy storages dynamically

The cost is multiple sync and the logic to handle storage juggling.

  • creating SubRanges dynamically from existing Negentropy storage

The cost is either do a store query for list of timestamp and hashes to remove or impl a new prune function into Negentropy.

Unless we do some further secondary caching, I think we already know that we're likely to use SubRange as we never want to consider the latest messages (up to 20s in the past) during a sync. We expect some "real time" jitter for ~20s, while everything older than 20s can be considered settled.

Even if we don't do that syncing a couple of message extra is not a problem.

Is there a way to do a partial sync over a Negentropy storage other than SubRange?

No, sub-range is the intended solution.

If it's not feasible to create dynamic SubRanges

No way it would be that inefficient (famous last words).

I'd also be interested in what the cost is to create ad-hoc larger Negentropy storages for e.g. multi-hour sync requests by querying the archive DB.

It would take longer to query the DB than to build the tree I would bet.

@chaitanyaprem
Copy link

chaitanyaprem commented May 1, 2024

The tree already store elements in order it would be trivial to impl. "erase element older than X" IMO.

I too had this thought initially, but considering it is a B-Tree, I don't think it would be that trivial as removal of entries require rebalancing of the tree and I am not sure if pruning is going to be as simple.
From the first look at BTree implementation in negentropy, it looks like removal of elements will have to be 1 by 1 which then felt very similar to invoking erase from storage of all elements that we want to remove.

Also, another thing to note is once we move to DB backed storage such as lmdb, each node will have to removed separately which would be its own operation and i am not sure if there would be a simple way to prune the tree.

But, then again there could be an algorithm to prune a B-Tree without removing elements 1 by 1. Could not find anything in a quick search. Will spend some more time again on this to see if there is any other alternative.

@SionoiS
Copy link
Author

SionoiS commented May 1, 2024

We could build our own storage impl. if the deleting of KVs is too costly. Some kind of growable ring buffer maybe?

@chaitanyaprem
Copy link

chaitanyaprem commented May 2, 2024

We could build our own storage impl. if the deleting of KVs is too costly. Some kind of growable ring buffer maybe?

Each delete operation on B-tree is supposed to be O(logn) complexity. But if we delete 1 by 1 in sequence, it will just result in a lot of tree rebalances. That coupled with the additions that would happen in parallel to the tree due to messages being received by the node would definitely require synchronization of these operations. i.e during prune which will be little longer operation additions either have to be blocked or deletion will have to happen in batches where additions are queued during deletion of a batch. This looks like it will lead to more complexity.

Also considering our final solution will be to use a database (either lmdb or something else), having to delete entries would definitelybe more costly than maintaining multiple storages.

How about an alternative approach of mixed use of multiple storages and SubRanges. What if we maintain separate storage for a longer duration (Maybe few hours) and use Subranges for syncing using the latest storage? This would have following benefits:

  1. We don't need to prune storages separately, rather just delete that specific storage which would be a single operation as it deletes entire tree.
  2. The problem of 2 syncs will happen only at the boundary which would not be that bad.
  3. We don't need to worry about synchronization of Sync-state (i.e the tree) because prune or deletion will happen on older storage only and addition of new entries will happen to the latest one.
  4. It would also address the point @jm-clius was trying to make, i.e for longer term sync we can have nodes that maintain longer Sync state just because they have larger disk space.

Also, i was taking a relook at code of StorageManager (that rotates and manages storage) and it doesn't look too complex to maintain.

Thoughts @SionoiS , @jm-clius ?

@SionoiS
Copy link
Author

SionoiS commented May 2, 2024

The problem is that the storage impl. is not adapted to our use case. The good news is that RBSR is agnostic to how elements are stored. Instead of mitigating the problem at a higher layer, let's actually fix the problem.

The first and easiest thing to impl. would be to delete elements 1 by 1 so that the POC is somewhat usable in a long running node (inefficiently it may be) soon. We then follow with storage rotation to mitigate the problem but we also work on a actual fix in the form of a new storage impl.

At that point, we could maybe benefit from swapping the C++ code for Rust or Nim since, we have to impl. a new storage and it would make interop easier too.

@jm-clius
Copy link
Contributor

jm-clius commented May 2, 2024

It does indeed seem like the storage implementation is too limited. However, if we can get the simplest POC solution implemented (even if it does a fairly expensive erase operation) we can evaluate the actual impact of the three avenues available:

  1. fix/maintain/rewrite the storage implementation for our use
  2. use some mitigating strategy, such as storage rotation (if this is significantly easier than (1) I think this is a reasonable approach)
  3. discard negentropy and implement some other sync structure

@chaitanyaprem
Copy link

RBSR is agnostic to how elements are stored.

Not sure if that is completely true. I mean from a protocol design it is.IIRC, it also considers a form of hashing that is tree-friendly so that by adding a single element in the tree doesn't result in recomputing hashes all again for whole tree.
We will have to design or come up with another data structure that fits this bill and not a tree-like if we have to support pruning (AFAIK all tree based data-structures would have same problem ).

The underlying persistence of the tree is flexible in the sense, we can easily store the tree in a postgres DB. I had noticed that this is going to be very easy job after going through the tree implementation. But changing the data-structure itself for storage is not going to be an easy task i think.

The first and easiest thing to impl. would be to delete elements 1 by 1 so that the POC is somewhat usable in a long running node (inefficiently it may be) soon. We then follow with storage rotation to mitigate the problem but we also work on a actual fix in the form of a new storage impl.

Agreed, and this would be easier done from top level so that any synchronization can be done at application (i.e in this case store sync level).

@SionoiS
Copy link
Author

SionoiS commented Aug 30, 2024

In the context of the recent focus on reliability, I've been thinking of more use cases for RBSR.

We could replace light push and reliability checks by syncing. Since we need to verify that the messages have been sent correctly, a sync session could be used to do the "push" and the verification at the same time. Node A would put a new message in it's own local cache then initiate a sync with node B. This would verify that the previous messages were sent correctly and also send the new message at the same time. Node B would have to keep a sub-set for each of it's light push clients.

The same logic and flow could also be use to replace the Filter protocol. A sync could be initiated from node B when it has received one or more messages that are interesting for one of it's light clients (IE part of their sub-set).

If we assume that nodes, light and "full" continuously sync with each other at opportune moments and each keep a set of the messages they are interested in, the same code can be used everywhere.

With the current impl. of Waku Sync it would be possible but awkward as it is not possible to efficiently create sub-sets, each cache would have to be managed on it's own and duplicate data. A better impl. would allow the creation of light weight sub-sets or to sync with set sorted by more than just time (probably content topic).

This would preserve the status-quo in term of privacy and increase bandwidth slightly while reducing overall complexity.

@fryorcraken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

5 participants