Do not crash collator when all relay chain RPCs drop. #4278

skunert · 2024-04-25T07:43:39Z

When starting a collator node with --relay-chain-rpc-urls, we expect that at least one of the specified nodes is available at all times.

We could look into accepting occasional disconnects and continue running. This would probably lead to error prints on the parachain side, but that should be acceptable in this case.

The text was updated successfully, but these errors were encountered:

# Description Adds retry logic that makes the RPC relay chain interface more reliable for the cases of a collator connecting to external RPC servers. Closes #5514 Closes #4278 Final solution still debated on #5514 , what this PR addresses might change (e.g. #4278 might require a more advanced approach). ## Integration Users that start collators should barely observe differences based on this logic, since the retry logic applies only in case the collators fail to connect to the RPC servers. In practice I assume the RPC servers are already live before starting collators, and the issue isn't visible. ## Review Notes The added retry logic is for retrying the connection to the RPC servers (which can be multiple). It is at the level of the cumulus/client/relay-chain-rpc-interface module, but more specifically relevant to the RPC clients logic (`ClientManager`). The retry logic is not configurable, it tries to connect to the RPC client for 5 times, with an exponential backoff in between each iteration starting with 1 second wait time and ending with 16 seconds. The same logic is applied in case an existing connection to an RPC is dropped. There is a `ReconnectingWebsocketWorker` who ensures there is connectivity to at least on RPC node, and the retry logic makes this stronger by insisting on trying connections to the RPC servers list for 5 times. ## Testing - This was tested manually by starting zombienet natively based on [006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml) and observing collators don't fail anymore: ```bash zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml ``` - Added a unit test that exercises the retry logic for a client connection to a server that comes online in 10 seconds. The retry logic can wait for as long as 30 seconds, but thought that it is too much for a unit test. Just being conscious of CI time if it runs this test, but I am happy to see suggestions around it too. I am not that sure either it runs in CI, haven't figured it out entirely yet. The test can be considered an integration test too, but it exercises crate internal implementation, not the public API. Collators example logs after the change: ``` 2024-08-29 14:28:11.730 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:12.737 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:12.739 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:12.755 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:14.758 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:14.759 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:14.760 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:18.766 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:26.770 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/" ``` --------- Signed-off-by: Iulian Barbu <[email protected]> Co-authored-by: Sebastian Kunert <[email protected]>

skunert added T0-node This PR/Issue is related to the topic “node”. I5-enhancement An additional feature request. labels Apr 25, 2024

skunert added this to SDK Node Apr 25, 2024

github-project-automation bot moved this to backlog in SDK Node Apr 25, 2024

skunert mentioned this issue Apr 25, 2024

Local Collator Panic - Connecting to Rococo with Validator RPC #4167

Closed

2 tasks

skunert mentioned this issue Aug 29, 2024

Failure in relay chain rpc interface's connection attempt to external RPC servers #5514

Closed

iulianbarbu mentioned this issue Aug 29, 2024

cumulus/client: added external rpc connection retry logic #5515

Merged

skunert closed this as completed in #5515 Sep 3, 2024

github-project-automation bot moved this from backlog to done in SDK Node Sep 3, 2024

iulianbarbu self-assigned this Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not crash collator when all relay chain RPCs drop. #4278

Do not crash collator when all relay chain RPCs drop. #4278

skunert commented Apr 25, 2024

Do not crash collator when all relay chain RPCs drop. #4278

Do not crash collator when all relay chain RPCs drop. #4278

Comments

skunert commented Apr 25, 2024