Failure in relay chain rpc interface's connection attempt to external RPC servers #5514

iulianbarbu · 2024-08-29T07:48:04Z

I am trying to start zombienet natively based on https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml. The collators always fail on my machine with the following error:

2024-08-22 18:51:42.758  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. index=0 url="ws://127.0.0.1:43283/"
2024-08-22 18:51:42.764  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. index=1 url="ws://127.0.0.1:37583/"
2024-08-22 18:51:42.776  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. index=2 url="ws://127.0.0.1:42159/"
2024-08-22 18:51:42.776 ERROR tokio-runtime-worker reconnecting-websocket-client: [Parachain] No valid RPC url found. Stopping RPC worker.
2024-08-22 18:51:42.777 ERROR tokio-runtime-worker sc_service::task_manager: [Parachain] Essential task `relay-chain-rpc-worker` failed. Shutting down service.    
thread 'main' panicked at cumulus/test/service/src/main.rs:144:18:
could not create Cumulus test service: Application(WorkerCommunicationError("RPC worker channel closed. This can hint and connectivity issues with the supplied RPC endpoints. Message: oneshot canceled"))
stack backtrace:
2024-08-22 18:51:42.774  INFO tokio-runtime-worker prometheus: [Relaychain] 〽️ Prometheus exporter started at 127.0.0.1:9616    
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: core::result::unwrap_failed
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1679:5
   3: core::result::Result<T,E>::expect
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1059:23
   4: test_parachain::main
             at ./polkadot-sdk/cumulus/test/service/src/main.rs:103:44
   5: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

There are ways to fix this in zombienet through a better coordination when starting the nodes, but I feel it can lead to overengineering e.g. it asks for implementation of starting nodes based on nodes dependencies, which is something other providers have support for it in various ways (k8s/podman), but I personally still don't find it straightforward.

The alternative is to add retry logic in cumulus/client/relay-chain-rpc-interface logic, that is sufficiently reliable for most cases, and if it becomes relevant for cases other than zombienet related testing, we could expose it in node configuration for users to set it as they need.

The text was updated successfully, but these errors were encountered:

skunert · 2024-08-29T08:33:54Z

The collators always fail on my machine with the following error:

That this test always fails is a bit strange, meaning that none of the three relay chain nodes given as connection came up. Did you check the relay chain nodes logs? Maybe it faces some issues and that is why the collators can not connect? Strange that it fails consistently for you, I ran this on my local machine just fine (and its also running in CI).

The alternative is to add retry logic in cumulus/client/relay-chain-rpc-interface logic

For this there is this issue: #4278
Would for sure increase the usability if the nodes would not crash in that scenario.

iulianbarbu · 2024-08-29T09:03:49Z

Did you check the relay chain nodes logs?

Both validators and RPC nodes look fine (no errors/warns), it is just that the RPC nodes come online in terms of the RPC API a bit later than when both collators attempt to connect to them. E.g. for the error I attached in the issue description, both collators attempt connections at similar times roughly (~3-4 seconds apart), the RPC nodes API came online ~20-30 seconds later.

Maybe it faces some issues and that is why the collators can not connect? Strange that it fails consistently for you, I ran this on my local machine just fine (and its also running in CI).

Given there are no errors/warns in the logs, do you suggest we should suspect my reproductions being deeper issues that degrade the start path for the RPC nodes and are visible only on my machine? To me it feels just like a slowness on the start path of the RPC nodes (reproducible for some reason on my machine, maybe because it is slower than others), but nothing serious (based on the logs for the RPC nodes and validators).

For this there is this issue: #4278

I see. Looks like a retry logic is relevant also for validators/RPC nodes coming offline. The RPC WS client tries reconnecting to the next external RPC server currently, but it will iterate through the entire external RPC servers list once (at worst) and then it will stop if all connections fail. The current issue is more related to the collator start up path, which ideally would be more insistent at the start through a basic retry logic, but collators can benefit from the same retry logic for RPC nodes coming offline (making the reconnections slightly better).

One curiosity I have do you have a rough idea how frequently all external RPC servers set for collators come offline for users in general? Asking just to make an idea how nuanced is this issue, and whether a configurable retry logic is a better fit as opposed to my initial suggestion of making the retry logic limited and hardcoded (which I thought is a good fit for the case where it is more relevant to testing scenarios rather than prod).

# Description Adds retry logic that makes the RPC relay chain interface more reliable for the cases of a collator connecting to external RPC servers. Closes #5514 Closes #4278 Final solution still debated on #5514 , what this PR addresses might change (e.g. #4278 might require a more advanced approach). ## Integration Users that start collators should barely observe differences based on this logic, since the retry logic applies only in case the collators fail to connect to the RPC servers. In practice I assume the RPC servers are already live before starting collators, and the issue isn't visible. ## Review Notes The added retry logic is for retrying the connection to the RPC servers (which can be multiple). It is at the level of the cumulus/client/relay-chain-rpc-interface module, but more specifically relevant to the RPC clients logic (`ClientManager`). The retry logic is not configurable, it tries to connect to the RPC client for 5 times, with an exponential backoff in between each iteration starting with 1 second wait time and ending with 16 seconds. The same logic is applied in case an existing connection to an RPC is dropped. There is a `ReconnectingWebsocketWorker` who ensures there is connectivity to at least on RPC node, and the retry logic makes this stronger by insisting on trying connections to the RPC servers list for 5 times. ## Testing - This was tested manually by starting zombienet natively based on [006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml) and observing collators don't fail anymore: ```bash zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml ``` - Added a unit test that exercises the retry logic for a client connection to a server that comes online in 10 seconds. The retry logic can wait for as long as 30 seconds, but thought that it is too much for a unit test. Just being conscious of CI time if it runs this test, but I am happy to see suggestions around it too. I am not that sure either it runs in CI, haven't figured it out entirely yet. The test can be considered an integration test too, but it exercises crate internal implementation, not the public API. Collators example logs after the change: ``` 2024-08-29 14:28:11.730 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:12.737 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:12.739 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:12.755 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:14.758 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:14.759 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:14.760 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:18.766 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:26.770 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/" ``` --------- Signed-off-by: Iulian Barbu <[email protected]> Co-authored-by: Sebastian Kunert <[email protected]>

iulianbarbu added I3-annoyance The node behaves within expectations, however this “expected behaviour” itself is at issue. T9-cumulus This PR/Issue is related to cumulus. T10-tests This PR/Issue is related to tests. labels Aug 29, 2024

github-actions bot added the I10-unconfirmed Issue might be valid, but it's not yet known. label Aug 29, 2024

iulianbarbu changed the title ~~Failure in relay chain rpc interface connection attempt to external RPC servers~~ Failure in relay chain rpc interface's connection attempt to external RPC servers Aug 29, 2024

iulianbarbu mentioned this issue Aug 29, 2024

cumulus/client: added external rpc connection retry logic #5515

Merged

skunert closed this as completed in #5515 Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure in relay chain rpc interface's connection attempt to external RPC servers #5514

Failure in relay chain rpc interface's connection attempt to external RPC servers #5514

iulianbarbu commented Aug 29, 2024

skunert commented Aug 29, 2024

iulianbarbu commented Aug 29, 2024 •

edited

Loading

Failure in relay chain rpc interface's connection attempt to external RPC servers #5514

Failure in relay chain rpc interface's connection attempt to external RPC servers #5514

Comments

iulianbarbu commented Aug 29, 2024

skunert commented Aug 29, 2024

iulianbarbu commented Aug 29, 2024 • edited Loading

iulianbarbu commented Aug 29, 2024 •

edited

Loading