-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure in relay chain rpc interface's connection attempt to external RPC servers #5514
Comments
That this test always fails is a bit strange, meaning that none of the three relay chain nodes given as connection came up. Did you check the relay chain nodes logs? Maybe it faces some issues and that is why the collators can not connect? Strange that it fails consistently for you, I ran this on my local machine just fine (and its also running in CI).
For this there is this issue: #4278 |
Both validators and RPC nodes look fine (no errors/warns), it is just that the RPC nodes come online in terms of the RPC API a bit later than when both collators attempt to connect to them. E.g. for the error I attached in the issue description, both collators attempt connections at similar times roughly (~3-4 seconds apart), the RPC nodes API came online ~20-30 seconds later.
Given there are no errors/warns in the logs, do you suggest we should suspect my reproductions being deeper issues that degrade the start path for the RPC nodes and are visible only on my machine? To me it feels just like a slowness on the start path of the RPC nodes (reproducible for some reason on my machine, maybe because it is slower than others), but nothing serious (based on the logs for the RPC nodes and validators).
I see. Looks like a retry logic is relevant also for validators/RPC nodes coming offline. The RPC WS client tries reconnecting to the next external RPC server currently, but it will iterate through the entire external RPC servers list once (at worst) and then it will stop if all connections fail. The current issue is more related to the collator start up path, which ideally would be more insistent at the start through a basic retry logic, but collators can benefit from the same retry logic for RPC nodes coming offline (making the reconnections slightly better). One curiosity I have do you have a rough idea how frequently all external RPC servers set for collators come offline for users in general? Asking just to make an idea how nuanced is this issue, and whether a configurable retry logic is a better fit as opposed to my initial suggestion of making the retry logic limited and hardcoded (which I thought is a good fit for the case where it is more relevant to testing scenarios rather than prod). |
# Description Adds retry logic that makes the RPC relay chain interface more reliable for the cases of a collator connecting to external RPC servers. Closes #5514 Closes #4278 Final solution still debated on #5514 , what this PR addresses might change (e.g. #4278 might require a more advanced approach). ## Integration Users that start collators should barely observe differences based on this logic, since the retry logic applies only in case the collators fail to connect to the RPC servers. In practice I assume the RPC servers are already live before starting collators, and the issue isn't visible. ## Review Notes The added retry logic is for retrying the connection to the RPC servers (which can be multiple). It is at the level of the cumulus/client/relay-chain-rpc-interface module, but more specifically relevant to the RPC clients logic (`ClientManager`). The retry logic is not configurable, it tries to connect to the RPC client for 5 times, with an exponential backoff in between each iteration starting with 1 second wait time and ending with 16 seconds. The same logic is applied in case an existing connection to an RPC is dropped. There is a `ReconnectingWebsocketWorker` who ensures there is connectivity to at least on RPC node, and the retry logic makes this stronger by insisting on trying connections to the RPC servers list for 5 times. ## Testing - This was tested manually by starting zombienet natively based on [006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml) and observing collators don't fail anymore: ```bash zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml ``` - Added a unit test that exercises the retry logic for a client connection to a server that comes online in 10 seconds. The retry logic can wait for as long as 30 seconds, but thought that it is too much for a unit test. Just being conscious of CI time if it runs this test, but I am happy to see suggestions around it too. I am not that sure either it runs in CI, haven't figured it out entirely yet. The test can be considered an integration test too, but it exercises crate internal implementation, not the public API. Collators example logs after the change: ``` 2024-08-29 14:28:11.730 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:12.737 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:12.739 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:12.755 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:14.758 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:14.759 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:14.760 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:18.766 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:26.770 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/" ``` --------- Signed-off-by: Iulian Barbu <[email protected]> Co-authored-by: Sebastian Kunert <[email protected]>
# Description Adds retry logic that makes the RPC relay chain interface more reliable for the cases of a collator connecting to external RPC servers. Closes #5514 Closes #4278 Final solution still debated on #5514 , what this PR addresses might change (e.g. #4278 might require a more advanced approach). ## Integration Users that start collators should barely observe differences based on this logic, since the retry logic applies only in case the collators fail to connect to the RPC servers. In practice I assume the RPC servers are already live before starting collators, and the issue isn't visible. ## Review Notes The added retry logic is for retrying the connection to the RPC servers (which can be multiple). It is at the level of the cumulus/client/relay-chain-rpc-interface module, but more specifically relevant to the RPC clients logic (`ClientManager`). The retry logic is not configurable, it tries to connect to the RPC client for 5 times, with an exponential backoff in between each iteration starting with 1 second wait time and ending with 16 seconds. The same logic is applied in case an existing connection to an RPC is dropped. There is a `ReconnectingWebsocketWorker` who ensures there is connectivity to at least on RPC node, and the retry logic makes this stronger by insisting on trying connections to the RPC servers list for 5 times. ## Testing - This was tested manually by starting zombienet natively based on [006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml) and observing collators don't fail anymore: ```bash zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml ``` - Added a unit test that exercises the retry logic for a client connection to a server that comes online in 10 seconds. The retry logic can wait for as long as 30 seconds, but thought that it is too much for a unit test. Just being conscious of CI time if it runs this test, but I am happy to see suggestions around it too. I am not that sure either it runs in CI, haven't figured it out entirely yet. The test can be considered an integration test too, but it exercises crate internal implementation, not the public API. Collators example logs after the change: ``` 2024-08-29 14:28:11.730 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:12.737 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:12.739 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:12.755 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:14.758 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:14.759 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:14.760 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:18.766 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/" 2024-08-29 14:28:18.768 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/" 2024-08-29 14:28:26.770 INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/" ``` --------- Signed-off-by: Iulian Barbu <[email protected]> Co-authored-by: Sebastian Kunert <[email protected]>
I am trying to start zombienet natively based on https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml. The collators always fail on my machine with the following error:
There are ways to fix this in zombienet through a better coordination when starting the nodes, but I feel it can lead to overengineering e.g. it asks for implementation of starting nodes based on nodes dependencies, which is something other providers have support for it in various ways (k8s/podman), but I personally still don't find it straightforward.
The alternative is to add retry logic in cumulus/client/relay-chain-rpc-interface logic, that is sufficiently reliable for most cases, and if it becomes relevant for cases other than zombienet related testing, we could expose it in node configuration for users to set it as they need.
The text was updated successfully, but these errors were encountered: