diff --git a/README.md b/README.md index 248d2f4..9981b32 100644 --- a/README.md +++ b/README.md @@ -48,8 +48,7 @@ This repository contains specifications for the Waku suite of protocols. |[WAKU2-NOISE](standards/application/noise.md)| Noise Protocols for Waku Payload Encryption | |[TOR-PUSH](standards/application/tor-push.md)| Waku Tor Push | |[RLN-KEYSTORE](standards/application/rln-keystore.md)| Waku RLN Keystore JSON schema | -|[TRANSPORT-RELIABILITY](standards/application/transport-reliability.md)| Waku Transport Reliability | -|[REQ-RES-RELIABILITY](standards/application/req-res-reliability.md)| Reliability for request-response protocols | +|[P2P-RELIABILITY](standards/application/p2p-reliability.md)| Waku P2P Reliability | ### Informational diff --git a/standards/application/req-res-reliability.md b/standards/application/req-res-reliability.md deleted file mode 100644 index 3a10cd2..0000000 --- a/standards/application/req-res-reliability.md +++ /dev/null @@ -1,146 +0,0 @@ ---- -title: REQ-RES-RELIABILITY -name: Reliability for request-response protocols -category: Standards Track -tags: [reliability, application] -editor: Oleksandr Kozlov -contributors: -- Prem Chaitanya Prathi -- Danish Arora -- Oleksandr Kozlov ---- - -## Abstract -This RFC describes a set of instructions used across different [WAKU2](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/10/waku2.md) implementations for improved reliability during usage of request-response protocols by a light node: -- [WAKU2-LIGHTPUSH](../standards/core/lightpush.md) - is used for sending messages; -- [WAKU2-FILTER](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md) - is used for receiving messages; - -## Design Requirements -The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, -“RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in [RFC2119](https://www.ietf.org/rfc/rfc2119.txt). - -### Definitions -- Service node - provides services to other nodes such as relaying messages sent by LightPush to the network or service messages from the network through Filter, usually serves responses; -- Light node - connects to and uses one or more service nodes via LightPush and/or Filter protocols, usually sends requests; -- Service node failure - can mean various things depending on the protocol in use: - - generic protocol failure - request is timed out or failed without error codes; - - LightPush specific failure - refer to [error codes](../standards/core/lightpush.md#examples-of-possible-error-codes) and consider request a failure when it is clear that service node cannot serve any future request, for example when service node does not have any peers to relay and returns `NO_PEERS_TO_RELAY`; - - Filter specific failure - we consider service node failing when it cannot serve [subscribe](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe) or [ping](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping) request with OK status; - -## Motivation - -Specifications of the mentioned protocols do not define some of the real world use cases that are often observed in unreliable network environment from the perspective of light nodes that are consumers of LightPush and/or Filter protocols. -Such use cases can be: recovery from offline state, decreasing the number of missed messages, increasing the probability of messages being broadcasted within the network, mitigating unreliability of the service node in use. - -## Recommendations - -### Node health - -Node health is a metric meant to determine the connectivity state of a light node and its present ability to reliably send and receive messages from the network. -We consider this reliability to be dependent on amount of simultaneous connections to responsive service nodes. -Unfortunately the more connections light node establishes - the more bandwidth is consumed. -To address this we RECOMMEND following states: -- unhealthy: - - no connections to service nodes are available regardless of protocol; -- minimally healthy: - - Filter has one service node connection AND LightPush protocol has one service node connection; -- sufficiently healthy: - - Filter has at least 2 connections available to service nodes AND LightPush has at least 2 connections available to service nodes; - -Overall node health SHOULD be considered as unhealthy or minimally healthy respectively if one of the protocols is unhealthy or minimally healthy, while the other is sufficiently healthy. - -### Peers and connection management - -#### Pool of reliable service nodes -Light nodes SHOULD maintain a pool of reliable service nodes for each protocol and have a connection to them. -In case service node [fails](./req-res-reliability.md#definitions) to serve protocol request - -light node MAY drop connection to it and replace with a new service node in the pool. - -We RECOMMEND to replace service node for LightPush right after first failure in case: -- connection to it is lost or request timed out; -- its response contains [error codes](../standards/core/lightpush.md#examples-of-possible-error-codes): `UNSUPPORTED_PUBSUB_TOPIC`, `INTERNAL_SERVER_ERROR` or `NO_PEERS_TO_RELAY`; -- request failed but without error message returned; - -For Filter we'd RECOMMEND replacing service node: -- [request for subscription](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscribe) fails; -- [ping](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping) failed 2 times in a row; - -#### Selection of discovered service nodes -During discovery light node SHOULD filter out service nodes based on preferences before establishing connection. -These preferences MAY include: -- [Libp2p multiadresses](https://github.com/libp2p/specs/blob/master/addressing/README.md) of a service node; -- Waku or libp2p protocols that a service node implements; -- Wakus shards that a service node is part of; - -More details about discovery can be found at [WAKU2 Discovery domain](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/10/waku2.md#discovery-domain) or [RELAY-SHARDING Discovery](https://github.com/waku-org/specs/blob/master/standards/core/relay-sharding.md#discovery). - -Examples of filtering: -- When light node discovers service nodes that implement needed Waku protocols - it SHOULD prioritize those that implement most recent version of protocol; -- Light node MUST connect only to those service nodes that participate in needed shard and cluster; -- Light node MUST use only those service nodes that implement needed transport protocols; -- When [Circuit V2](https://github.com/libp2p/specs/blob/master/relay/circuit-v2.md) multi-addresses discovered by a light node - it SHOULD prefer other service nodes that can be connected directly if possible; - -#### Continuous discovery -Light nodes MUST keep information about service nodes up to date. -For example when a service node is discovered second time, -we need to be sure to keep connection information up to date in Peer Store. - -The following information MUST be up to date: -- [ENR](../standards/core/enr.md) information; -- [Libp2p multiaddresses](https://github.com/libp2p/specs/blob/master/addressing/README.md); - -### LightPush - -#### Sending with redundancy -To improve chances of delivery of messages light node can attempt sending same message via LightPush to 2 or more service nodes at the same time. -While doing so it is important to note that bandwidth consumption increases proportionally to amount of additional service nodes used. -We RECOMMEND to use 2 service nodes at a time. - -#### Retry on failure -When light node sends a message it MUST await for LightPush response from service node and check it for [possible error codes](../standards/core/lightpush.md#examples-of-possible-error-codes). -In case request failed without error code, timed out or response contains errors that can be temporary for service node (e.g `TOO_MANY_REQUESTS`) - -light node SHOULD try to re-send message after some interval and continue doing so until OK response is received or canceled. -Interval time can be arbitrary but we RECOMMEND starting with 1 second and increasing it on each failure during LightPush send. -Important to note that [per another recommendation](./req-res-reliability.md#pool-of-reliable-service-nodes) - light node SHOULD replace failing service node with another within the pool of service nodes used by LightPush. - -#### Retry missing messages -Light node can verify that network that is used at the moment has seen messages that were sent via LightPush earlier. -In order to do that light node SHOULD use [Store protocol](../standards/core/store.md) or [Filter protocol](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md) to a different service node than the one used for LightPush service node. - -By using Store protocol light node can query any service node that implements Store protocol and see if the messages that were sent in the past period were seen. -Due to [Store message eligibility](https://github.com/waku-org/specs/blob/master/standards/core/store.md#waku-message-store-eligibility) only some of the messages will be stored so there is a limit as to which messages can be verified by Store queries. -We RECOMMEND to do periodic Store queries once per 30 seconds. - -By using Filter protocol's active [subscription](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#filter-push) light node can verify that message that was sent through LightPush was seen by another service node in the network. -Filter protocol does not have such limitation as to type of messages received with subscription -but active subscription does not allow to see messages exchanged in the network while light node was offline. - -In case some of the messages were not verified by any of the previous methods - they SHOULD be re-sent by LightPush using different service node. - -### Filter - -#### Regular pings -To ensure that subscription is maintained by a service node and not closed - light node SHOULD do recurring [pings](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping). -We RECOMMEND for light node to send ping requests once per minute. -In case light node does not receive OK response or it times out 2 times - such service node SHOULD be replaced as part of maintenance of [pool of reliable service nodes](./req-res-reliability.md#pool-of-reliable-service-nodes). -Right after such replace light node MUST create new subscription to newly connected service node as described in [Filter specification](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md). - -#### Redundant subscriptions for message loss mitigation -To mitigate possibility of messages not being delivered by a service node - we RECOMMEND to consider using multiple Filter subscriptions. -Light node can initiate two subscriptions to the same content topic but to different service nodes. -While receiving messages through two subscriptions - duplicates MUST be dropped by using [deterministic message hashing](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/14/message.md#deterministic-message-hashing). -Note that such approach increases bandwidth consumption proportionally to amount of extra subscriptions established and SHOULD be used with caution. - -#### Offline recoverability -Network state SHOULD be monitored by light node and in case it goes offline - [regular pings](./req-res-reliability.md#regular-pings) MUST be stopped. -When network connection returns light node SHOULD initiate [Filter ping](https://github.com/vacp2p/rfc-index/blob/7b443c1aab627894e3f22f5adfbb93f4c4eac4f6/waku/standards/core/12/filter.md#subscriber_ping) to service nodes in use. -In case those pings fail light node MUST replace service nodes following advice of [pool of reliable service nodes](./req-res-reliability.md#pool-of-reliable-service-nodes) without waiting for multiple failures. -Note that [HistoryQuery](../standards/core/store.md) can be used if a light node wants to retrieve messages circulated in the network while it was offline. - -## Security/Privacy Considerations - -See [WAKU2-ADVERSARIAL-MODELS](https://github.com/waku-org/specs/blob/master/informational/adversarial-models.md). - -## Copyright - -Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). diff --git a/standards/application/transport-reliability.md b/standards/application/transport-reliability.md deleted file mode 100644 index bde101a..0000000 --- a/standards/application/transport-reliability.md +++ /dev/null @@ -1,196 +0,0 @@ ---- -title: TRANSPORT-RELIABILITY -name: Waku Transport Reliability -category: Standards Track -tags: [reliability, application] -editor: Kaichao Sun -contributors: - - Richard Ramos ---- - -## Abstract - -Waku provides an efficient transport layer for p2p communications. -It defines protocols like [Relay](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/11/relay.md) and [Lightpush](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/19/lightpush.md) / [Filter](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/12/filter.md) for routing messages in decentralised networks. -However, there is no guarantee that a message broadcast in a Waku network will reach its destination. - -For example, the receiver in a chat application using Waku as p2p transport may miss messages when a network issue happens at either the sender or the receiver side. - -In general, a message in a Waku network may be in one of 3 states from the sender's perspective: - -- **outgoing**, the message is posted by the sender but no confirmations from other nodes yet -- **sent**, the message is received by any other node in the network -- **delivered**, the message is acknowledged on the application layer by the intended recipient - -Application like Status already uses [MVDS](https://github.com/vacp2p/rfc-index/blob/main/vac/2/mvds.md) for e2e acknowledgement in direct messages and group chat. Also there is an ongoing [discussion](https://forum.vac.dev/t/end-to-end-reliability-for-scalable-distributed-logs/293) about a more general and bandwidth efficient solution for e2e reliablity. - -In other words, an application defines a payload over Waku and is interested in e2e delivery between application users. Waku provides a pub/sub broadcast transport, which is interested in reliably routing a message to all participants in the broadcast group. - -Before we have a complete design for e2e reliability, we need to compose existing protocols to increase the reliability of the transport protocol. This document proposes a few options for such composition. - -## Motivation - -The [Store protocol](https://github.com/waku-org/specs/blob/master/standards/core/store.md) provides a way for nodes in the network to query the existence of messages or fetch specific messages based on the search criteria. - -**Query criteria with message hash** - -For the nodes that may have connection issues to **publish** messages via transport layer, this search criteria can be used to check whether a message is populated in the network or not. The message exists in store node can be marked from `outgoing` to `sent` by application. If the message is not found in the store node, the application can resend the message. - -**Search criteria with topics and time range** - -For the nodes that may have connection issues to **receive** messages via transport layer, this search criteria can be used to fetch missing messages from store nodes periodically after network resumes. - -In summary, by leveraging the store node to provide such query services, the applications are able to mitigate reliability issues on the Waku transport layer. -This approach also introduces new limitations like centralized points of failures in Store nodes, diminished privacy, and lower scalability. -It should be viewed as a temporary solution and deprecated when an e2e reliability solution is ready. - -## Implementation Suggestions - -### Query with Message Hash - -For outgoing messages, the processing flow can be like this: -- create a buffer for all "outgoing" message hashes -- send message via relay or lightpush protocol -- add message hash to the buffer -- keep a copy of the message locally with status outgoing -- check the buffer periodically -- query the store node with message hash in the buffer of which the send attempt was more than a few seconds ago -- if the message exists, update its status to "sent" in local data store and remove the message hash from the buffer -- if the message does not exist, resend the message -- if the message is still missing in the store node for a period of time, trigger the message failed to send workflow and remove the message hash from the buffer - -The implementation in Python may look like this: - -```python -outgoingMessageHashes = [] - -class Message: - hash: str - postTime: int - status: str - content: str - -def send(message): - # send message via relay or lightpush protocol, here use relay as example - waku.relay.post(message) - outgoingMessageHashes.append(message.hash) - - message.status = 'outgoing' - database.saveMessage(message) - -def checkOutgoingMessages(peerID): - for messageHash in outgoingMessageHashes: - message = database.getMessage(messageHash) - # only query store node for ongoing message, and posted more than 3 seconds ago - if message.status == 'ongoing' && time.now() - message.postTime > 3: - response = waku.store.queryMessage(peerID, messageHash) - if response.exists(): - database.updateMessageStatus(messageHash, 'sent') - outgoingMessageHashes.remove(messageHash) - elif time.now() - message.postTime > 10: - # resend the message if it's not stored in store node after 10 seconds - waku.relay.post(message) -``` - -Function `checkOutgoingMessages` is called periodically, most likely every a few seconds. Message hashes can be queried in batch to reduce the number of requests to store nodes, the size in a batch shoud not exceed the max supported size by store node. - -The store node can be set and updated directly by application or selected from peers which are discovery by protocols like [discv5](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/33/discv5.md) or [peer exchange](https://github.com/waku-org/specs/blob/master/standards/core/peer-exchange.md). - -The store node may only support specific pubsub topics, and the application should group message hashes by pubsub topic before sending the request. - -When persistent network issue happens, you may not want to resend the failed messages indefinitely, the application should have a mechanism to clean the cache with failed message hashes and trigger other retry logic after a few attempts. - -### Query with Topics and Time Range - -An application could use different pubsub topics and content topics, for example a community may have its own pubsub topic, and each channel may have its own content topic. To fetch all missing messages in a specific channel, the application can query the store node with the provided pubsub topic, content topic and time range. - -For incoming messages, the processing flow can be like this: -- subscribe to the interested pubsub and content topics -- periodically query the store node with the interested topics and time range for message hashes -- check if each received message hash already exists in the local database. if not, add the missing message hash to a buffer. -- batch fetch the full messages corresponding to the missing message hashes in the buffer from the store node -- process the messages - -The implementation in Python may look like this: - -```python -class FetchRecord: - pubsubTopic: str - lastFetch: int - -class QueryParams: - pubsubTopic: str - contentTopics: List[str] - fromTime: int - toTime: int - -def fetchMissingMessages(peerID, queryParams): - missingMessageHashes = [] - - # get missing message identifiers first in order to reduce the data transfer - response = waku.store.queryMessageHashes(peerID, queryParams) - for !response.isComplete(): - # process each message in the response - response.messages().forEach(messageHash -> { - message = queryDbMessageByHash(messageHash) - if message.exists(): - continue - } - missingMessageHashes.append(messageHash) - }) - - # process next page of the response - response.Next() - - # fetch missing messages with hashes in batch - response = waku.store.queryMessagesByHash(peerID, missingMessageHashes) - response.messages().forEach(message -> { - processMessage(message) - }) - - updateFetchRecord(queryParams.pubsubTopic, queryParams.toTime) -``` - -`QueryParam` includes all the necessary information to fetch missing messages. The application should iterate all the interested pubsub topics, along with its content topics to construct the `QueryParam`. - -Function `fetchMissingMessages` is runing periocally, for example 1 minute. It first fetch all the message hashes in the specified time range, check if message exist in local dabatase, if not, fetch the missing messages in batch. The batch size should be bounded to avoid large data transfer or exceed the max supported size by store node. - -When finishing fetching missing messages, the application should update the last fetch time in `FetchRecord`. The last fetch time can be used to calculate the time range for the next fetch and avoid fetching the same messages again. - - -### Unified Query - -There are cases that both outgoing and incoming messages are queried in similar situation, for example at same interval. The application can combine the above two worflows into one to have a unified query with better performance overall. - -The workflow can be like this: -- create outgoing buffer for all "outgoing" messages -- create incoming buffer for all recently received message hashes -- periodically query store node based on interested topics and time range for message hashes -- check outgoing buffer with returned message hash, if included, mark message as `sent`, resend if needed -- check incoming buffer with returned message hash, if not included, fetch the missing message with its hash - -## Security and Performance Considerations - -The message query request exposes the metadata of clients to the store nodes, and the store node is capable to associate the messages with interested clents. - -The query requests add a fair amount of load to store nodes, and increased linearly with more users onboarded. Store nodes should be able to scale up and scale down itself by monitoring or predicting the workload. - -The store node can also be a target for DDoS attack. The store node should have a mechanism to prevent such attack. - -Application should provide options to configure different store nodes for its users, such nodes can either be self-hosted or public with better reputation. - - -## Copyright - -Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/). - -## References - -1. [Relay Protocol](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/11/relay.md) -2. [Light Push Protocl](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/19/lightpush.md) -3. [Filter Protocl](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/12/filter.md) -4. [MVDS - Minimum Viable Data Synchronization](https://github.com/vacp2p/rfc-index/blob/main/vac/2/mvds.md) -5. [End-to-end reliability for scalable distributed logs](https://forum.vac.dev/t/end-to-end-reliability-for-scalable-distributed-logs/293) -6. [Waku Store Query](https://github.com/waku-org/specs/blob/master/standards/core/store.md) -7. [Waku v2 Discv5 Ambient Peer Discovery](https://github.com/vacp2p/rfc-index/blob/main/waku/standards/core/33/discv5.md) -8. [Waku v2 Peer Exchange](https://github.com/waku-org/specs/blob/master/standards/core/peer-exchange.md)