Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update research docs #227

Merged
merged 1 commit into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/research/benchmarks/postgres-adoption.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ Notice that the two `nwaku` nodes run the very same version, which is compiled l

#### Comparing archive SQLite & Postgres performance in [nwaku-b6dd6899](https://github.com/waku-org/nwaku/tree/b6dd6899030ee628813dfd60ad1ad024345e7b41)

The next results were obtained by running the docker-compose-manual-binaries.yml from [test-waku-query-c078075](https://github.com/waku-org/test-waku-query/tree/c07807597faa781ae6c8c32eefdf48ecac03a7ba) in the sandbox machine (metal-01.he-eu-hel1.wakudev.misc.statusim.net.)
The next results were obtained by running the docker-compose-manual-binaries.yml from [test-waku-query-c078075](https://github.com/waku-org/test-waku-query/tree/c07807597faa781ae6c8c32eefdf48ecac03a7ba) in the sandbox machine (metal-01.he-eu-hel1.wakudev.misc.status.im.)

**Scenario 1**

Expand Down Expand Up @@ -155,7 +155,7 @@ In this case, the performance is similar regarding the timings. The store rate i

This nwaku commit is after a few **Postgres** optimizations were applied.

The next results were obtained by running the docker-compose-manual-binaries.yml from [test-waku-query-c078075](https://github.com/waku-org/test-waku-query/tree/c07807597faa781ae6c8c32eefdf48ecac03a7ba) in the sandbox machine (metal-01.he-eu-hel1.wakudev.misc.statusim.net.)
The next results were obtained by running the docker-compose-manual-binaries.yml from [test-waku-query-c078075](https://github.com/waku-org/test-waku-query/tree/c07807597faa781ae6c8c32eefdf48ecac03a7ba) in the sandbox machine (metal-01.he-eu-hel1.wakudev.misc.status.im.)

**Scenario 1**

Expand Down
73 changes: 28 additions & 45 deletions docs/research/research-and-studies/capped-bandwidth.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,63 +2,46 @@
title: Capped Bandwidth in Waku
---

This issue explains i) why The Waku Network requires a capped bandwidth per shard and ii) how to solve it by rate limiting with RLN by daily requests (instead of every x seconds), which would require RLN v2, or some modifications in the current circuits to work. It also explains why the current rate limiting RLN approach (limit 1 message every x seconds) is not practical to solve this problem.
This post explains i) why The Waku Network requires a capped bandwidth per shard and ii) how to achieve it by rate limiting with RLN v2.

## Problem

First of all, lets begin with the terminology. We have talked in the past about "predictable" bandwidth, but a better name would be "capped" bandwidth. This is because it is totally fine that the waku traffic is not predictable, as long as its capped. And it has to be capped because otherwise no one will be able to run a node.
First of all, let's begin with the terminology. We have talked in the past about "predictable" bandwidth, but a better name would be "capped" bandwidth. This is because it is totally fine that the waku traffic is not predictable, as long as it is capped. And it has to be capped because otherwise, no one will be able to run a node.

Since we aim that everyone is able to run a full waku node (at least subscribed to a single shard) its of paramount importance that the bandwidth requirements (up/down) are i) reasonable to run with a residential internet connection in every country and ii) limited to an upper value, aka capped. If the required bandwidth to stay up to date with a topic is higher than what the node has available, then it will start losing messages and won't be able to stay up to date with the topic messages. And not to mention the problems this will cause to other services and applications being used by the user.
Since we aim that everyone can run a full waku node (at least subscribed to a single shard) it is of paramount importance that the bandwidth requirements (up/down) are i) reasonable to run with a residential internet connection in every country and ii) limited to an upper value, aka capped. If the required bandwidth to stay up to date with a topic is higher than what the node has available, then it will start losing messages and won't be able to stay up to date with the topic messages. And not to mention the problems this will cause to other services and applications being used by the user.

The main problem is that one can't just chose the bandwidth it allocates to `relay`. One could set the maximum bandwidth willing to allocate to `store` but this is not how `relay` works. The required bandwidth is not set by the node, but by the network. If a pubsub topic `a` has a traffic of 50 Mbps (which is the sum of all messages being sent multiplied by its size, times the D_out degree), then if a node wants to stay up to date in that topic, and relay traffic in it, then it will require 50 Mbps. There is no thing such as "partially contribute" to the topic (with eg 25Mbps) because then you will be losing messages, becoming an unreliable peer. The network sets the pace.
The main problem is that one can't just choose the bandwidth it allocates to `relay`. One could set the maximum bandwidth willing to allocate to `store` but this is not how `relay` works. The required bandwidth is not set by the node, but by the network. If a pubsub topic `a` has a traffic of 50 Mbps (which is the sum of all messages being sent multiplied by its size, times the D_out degree), then if a node wants to stay up to date in that topic, and relay traffic in it, then it will require 50 Mbps. There is no thing such as "partially contributing" to the topic (with eg 25Mbps) because then you will be losing messages, becoming an unreliable peer and potentially be disconnected. The network sets the pace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The main problem is that one can't just choose the bandwidth it allocates to `relay`. One could set the maximum bandwidth willing to allocate to `store` but this is not how `relay` works. The required bandwidth is not set by the node, but by the network. If a pubsub topic `a` has a traffic of 50 Mbps (which is the sum of all messages being sent multiplied by its size, times the D_out degree), then if a node wants to stay up to date in that topic, and relay traffic in it, then it will require 50 Mbps. There is no thing such as "partially contributing" to the topic (with eg 25Mbps) because then you will be losing messages, becoming an unreliable peer and potentially be disconnected. The network sets the pace.
The main problem is that one can't just choose the bandwidth it allocates to `relay`. One could set the maximum bandwidth willing to allocate to `store` but this is not how `relay` works. The required bandwidth is not set by the node, but by the network. If a pubsub topic `a` has a traffic of 50 Mbps (which is the sum of all messages being sent multiplied by its size, times the D_out degree), then if a node wants to stay up to date in that topic, and relay traffic in it, then it will require 50 Mbps. There is no such thing as "partially contributing" to the topic (with eg 25Mbps) because then you will be losing messages, becoming an unreliable peer and potentially be disconnected. The network sets the pace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The main problem is that one can't just choose the bandwidth it allocates to `relay`. One could set the maximum bandwidth willing to allocate to `store` but this is not how `relay` works. The required bandwidth is not set by the node, but by the network. If a pubsub topic `a` has a traffic of 50 Mbps (which is the sum of all messages being sent multiplied by its size, times the D_out degree), then if a node wants to stay up to date in that topic, and relay traffic in it, then it will require 50 Mbps. There is no thing such as "partially contributing" to the topic (with eg 25Mbps) because then you will be losing messages, becoming an unreliable peer and potentially be disconnected. The network sets the pace.
The main problem is that one can't just choose the bandwidth it allocates to `relay`. One could set the maximum bandwidth willing to allocate to `store` but this is not how `relay` works. The required bandwidth is not set by the node, but by the network. If a pubsub topic `a` has a traffic of 50 Mbps (which is the sum of all messages being sent multiplied by its size, times the D_out degree), then if a node wants to stay up to date in that topic, and relay traffic in it, then it will require 50 Mbps. There is no thing such as "partially contributing" to the topic (with eg 25Mbps) because then you will be losing messages, becoming an unreliable peer, and can potentially be disconnected. The network sets the pace.


So waku needs an upper boundary on the in/out bandwidth (mbps) it consumes. Just like apps have requirements on cpu and memory, we should set a requirement on bandwidth, and then guarantee that if you have that bandwidth, you will be able to run a node without any problem. And this is the tricky part.
So waku needs an upper boundary on the in/out bandwidth (mbps) it consumes. Just like apps have requirements on cpu and memory, we should set a requirement on bandwidth, and then guarantee that if you have that bandwidth, you will be able to run a node without any problem. And this is the tricky part. This metric is Waku's constraint, similar to the gas-per-block limit in blockchains.

## Current approach
## Previous Work

With the recent productisation effort of RLN, we have part of the problem solved, but not entirely. RLN offers an improvement, since now have a pseudo-identity (RLN membership) that can be used to rate limit users, enforcing a limit on how often it can send a message (eg 1 message every 10 seconds). We assume of course, that getting said RLN membership requires to pay something, or put something at stake, so that it can't be sibyl attacked.
Quick summary of the evolution to solve this problem:
* Waku started with no rate-limiting mechanism. The network was subject to DoS attacks.
* RLN v1 was introduced, which allowed to rate-limit in a privacy-preserving and anonymous way. The rate limit can be configured to 1 message every `y` seconds. However, this didn't offer much granularity. A low `y` would allow too many messages and a high `y` would make the protocol unusable (impossible to send two messages in a row).
* RLN v2 was introduced, which allows to rate-limit each user to `x` messages every `y` seconds. This offers the granularity we need. It is the current solution deployed in The Waku Network.
Comment on lines +21 to +22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should highlight that for both RLN v1 and v2 y is fixed for all users. Variable epoch will only be available in RLN v3 (if we ever need it).

Copy link
Contributor Author

@fryorcraken fryorcraken Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Rate limiting with RLN so that each entity just sends 1 message every x seconds indeed solves the spam problem but it doesn't per se cap the traffic. In order to cap the traffic, we would first need to cap the amount of memberships we allow. Lets see an example:
- We limit to 10.000 RLN memberships
- Each ones is rate limited to send 1 message/10 seconds
- Message size of 50 kBytes
## Current Solution (RLN v2)

Having this, the worst case bandwidth that we can theoretically have, would be if all of the memberships publish messages at the same time, with the maximum size, continuously. That is `10.000 messages/sec * 50 kBytes = 500 MBytes/second`. This would be a burst every 10 seconds, but enough to leave out the majority of the nodes. Of course this assumption is not realistic as most likely not everyone will continuously send messages at the same time and the size will vary. But in theory this could happen.
The current solution to this problem is the usage of RLN v2, which allows to rate-limit `x` messages every `y` seconds. On top of this, the introduction of [WAKU2-RLN-CONTRACT](https://github.com/waku-org/specs/blob/master/standards/core/rln-contract.md) enforces a maximum amount of messages that can be sent to the network per `epoch`. This is achieved by limiting the amount of memberships that can be registered. The current values are:
* `R_{max}`: 160000 mgs/epoch
* `r_{max}`: 600 msgs/epoch
* `r_{min}`: 20 msgs/epoch

A naive (and not practical) way of fixing this, would be to design the network for this worst case. So if we want to cap the maximum bandwidth to 5 MBytes/s then we would have different options on the maximum i) amount of RLN memberships and ii) maximum message size:
- `1.000` RLN memberships, `5` kBytes message size: `1000 * 5 = 5 MBytes/s`
- `10.000` RLN memberships, `500` Bytes message size: `10000 * 0.5 = 5 MBytes/s`
In other words, the contract limits the amount of memberships that can be registered from `266` to `8000` depending on which rate limit users choose.

In both cases we cap the traffic, however, if we design The Waku Network like this, it will be massively underutilized. As an alternative, the approach we should follow is to rely on statistics, and assume that i) not everyone will be using the network at the same time and ii) message size will vary. So while its impossible to guarantee any capped bandwidth, we should be able to guarantee that with 95 or 99% confidence the bandwidth will stay around a given value, with a maximum variance.
On the other hand [64/WAKU2-NETWORK](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/64/network.md) states that:
* `rlnEpochSizeSec`: 600. Meaning the epoch size is 600 seconds.
* `maxMessageSize`: 150KB. Meaning the maximum message size that is allowed. Note: recommended average of 4KB.

The current RLN approach of rate limiting 1 message every x seconds is not very practical. The current RLN limitations are enforced on 1 message every x time (called `epoch`). So we currently can allow 1 msg per second or 1 msg per 10 seconds by just modifying the `epoch` size. But this has some drawbacks. Unfortunately, neither of the options are viable for waku:
1. A small `epoch` size (eg `1 seconds`) would allow a membership to publish `24*3600/1=86400` messages a day, which would be too much. In exchange, this allows a user to publish messages right after the other, since it just have to wait 1 second between messages. Problem is that having an rln membership being able to publish this amount of messages, is a bit of a liability for waku, and hinders scalability.
2. A high `epoch` size (eg `240 seconds`) would allow a membership to publish `24*3600/240=360` messages a day, which is a more reasonable limit, but this won't allow a user to publish two messages one right after the other, meaning that if you publish a message, you have to way 240 seconds to publish the next one. Not practical, a no go.
Putting this all together and assuming:
* Messages are sent uniformly distributed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Messages are sent uniformly distributed.
* Messages are sent with uniform distribution.

* All users totally consumes its rate-limit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* All users totally consumes its rate-limit.
* All users totally consume its rate-limit.


But what if we widen the window size, and allow multiple messages within that window?
We can expect the following message rate and bandwidth for the whole network:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We can expect the following message rate and bandwidth for the whole network:
We can expect the following message rate and bandwidth for the entire network:

* A traffic of `266 msg/second` on average (`160000/600`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A traffic of `266 msg/second` on average (`160000/600`)
* Traffic of `266 msg/second` on average (`160000/600`)

* A traffic of `6 MBps` on average (266 * 4KB * 6), where `4KB` is the average message size and `6` is the average gossipsub D-out degree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A traffic of `6 MBps` on average (266 * 4KB * 6), where `4KB` is the average message size and `6` is the average gossipsub D-out degree.
* Traffic of `6 MBps` on average (266 * 4KB * 6), where `4KB` is the average message size and `6` is the average gossipsub D-out degree.


## Solution

In order to fix this, we need bigger windows sizes, to smooth out particular bursts. Its fine that a user publishes 20 messages in one second, as long as in a wider window it doesn't publish more than, lets say 500. This wider window could be a day. So we could say that a membership can publish `250 msg/day`. With this we solve i) and ii) from the previous section.

Some quick napkin math on how this can scale:
- 10.000 RLN memberships
- Each RLN membership allow to publish 250 msg/day
- Message size of 5 kBytes

Assuming a completely random distribution:
- 10.000 * 250 = 2 500 000 messages will be published a day (at max)
- A day has 86 400 seconds. So with a random distribution we can say that 30 msg/sec (at max)
- 30 msg/sec * 5 kBytes/msg = 150 kBytes/sec (at max)
- Assuming D_out=8: 150 kBytes/sec * 8 = 1.2 MBytes/sec (9.6 Mbits/sec)

So while its still not possible to guarantee 100% the maximum bandwidth, if we rate limit per day we can have better guarantees. Looking at these numbers, considering a single shard, it would be feasible to serve 10.000 users considering a usage of 250 msg/day.

TODO: Analysis on 95%/99% interval confidence on bandwidth given a random distribution.

## TLDR

- Waku should guarantee a capped bandwidth so that everyone can run a node.
- The guarantee is a "statistical guarantee", since there is no way of enforcing a strict limit.
- Current RLN approach is to rate limit 1 message every x seconds. A better approach would be x messages every day, which helps achieving such bandwidth limit.
- To follow up: Variable RLN memberships. Eg. allow to chose tier 1 (100msg/day) tier 2 (200msg/day) etc.
And assuming a uniform distribution of traffic among 8 shards:
* `33 msg/second` per shard.
* `0.75 MBps` per shard.
4 changes: 2 additions & 2 deletions docs/research/research-and-studies/maximum-bandwidth.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ The **trade-off is clear**:
So it's about where to draw this line.

Points to take into account:
- **Relay contributes to bandwidth the most**: Relay is the protocol that mostly contributes to bandwidth usage, and it can't choose to allocate fewer bandwidth resources like other protocols (eg `store` can choose to provide less resources and it will work). In other words, the network sets the relay bandwidth requirements, and if the node can't meet them, it just won't work.
- **Relay contributes to bandwidth the most**: Relay is the protocol that mostly contributes to bandwidth usage, and it can't choose to allocate fewer bandwidth resources like other protocols (eg `store` can choose to provide less resources and it will work). In other words, the network sets the relay bandwidth requirements, and if the node can't meet them, it just wont work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Relay contributes to bandwidth the most**: Relay is the protocol that mostly contributes to bandwidth usage, and it can't choose to allocate fewer bandwidth resources like other protocols (eg `store` can choose to provide less resources and it will work). In other words, the network sets the relay bandwidth requirements, and if the node can't meet them, it just wont work.
- **Relay consumes the most bandwidth**: Relay is the protocol that mostly contributes to bandwidth usage, and it can't choose to allocate fewer bandwidth resources like other protocols (eg `store` can choose to provide less resources and it will work). In other words, the network sets the relay bandwidth requirements, and if the node can't meet them, it just wont work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Relay contributes to bandwidth the most**: Relay is the protocol that mostly contributes to bandwidth usage, and it can't choose to allocate fewer bandwidth resources like other protocols (eg `store` can choose to provide less resources and it will work). In other words, the network sets the relay bandwidth requirements, and if the node can't meet them, it just wont work.
- **Relay contributes to bandwidth the most**: Relay is the protocol that mostly contributes to bandwidth usage, and it can't choose to allocate fewer bandwidth resources like other protocols (eg `store` can choose to provide less resources and it will work). In other words, the network sets the relay bandwidth requirements, and if the node can't meet them, it just won't work.

- **Upload and download bandwidth are the same**: Due to how gossipsub works, and hence `relay`, the bandwidth consumption is symmetric, meaning that upload and download bandwidth is the same. This is because of `D` and the reciprocity of the connections, meaning that one node upload is another download.
- **Nodes not meeting requirements can use light clients**. Note that nodes not meeting the bandwidth requirements can still use waku, but they will have to use light protocols, which are a great alternative, especially on mobile, but with some drawbacks (trust assumptions, less reliability, etc)
- **Waku can't take all the bandwidth:** Waku is meant to be used in conjunction with other services, so it shouldn't consume all the existing bandwidth. If Waku consumes `x Mbps` and someone bandwidth is `x Mpbs`, the UX won't be good.
Expand All @@ -80,4 +80,4 @@ Coming up with a number:

**Conclusion:** Limit to `10 Mbps` each waku shard. How? Not trivial, see https://github.com/waku-org/research/issues/22#issuecomment-1727795042

*Note:* This number is not set in stone and is subject to modifications, but it will most likely stay in the same order of magnitude if changed.
*Note:* This number is not set in stone and is subject to modifications, but it will most likely stay in the same order of magnitude if changed.
Loading