Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not crash collator when all relay chain RPCs drop. #4278

Closed
skunert opened this issue Apr 25, 2024 · 0 comments · Fixed by #5515
Closed

Do not crash collator when all relay chain RPCs drop. #4278

skunert opened this issue Apr 25, 2024 · 0 comments · Fixed by #5515
Assignees
Labels
I5-enhancement An additional feature request. T0-node This PR/Issue is related to the topic “node”.

Comments

@skunert
Copy link
Contributor

skunert commented Apr 25, 2024

When starting a collator node with --relay-chain-rpc-urls, we expect that at least one of the specified nodes is available at all times.

We could look into accepting occasional disconnects and continue running. This would probably lead to error prints on the parachain side, but that should be acceptable in this case.

@skunert skunert added T0-node This PR/Issue is related to the topic “node”. I5-enhancement An additional feature request. labels Apr 25, 2024
@skunert skunert added this to SDK Node Apr 25, 2024
@github-project-automation github-project-automation bot moved this to backlog in SDK Node Apr 25, 2024
github-merge-queue bot pushed a commit that referenced this issue Sep 3, 2024
# Description

Adds retry logic that makes the RPC relay chain interface more reliable
for the cases of a collator connecting to external RPC servers.

Closes #5514 
Closes #4278

Final solution still debated on #5514 , what this PR addresses might
change (e.g. #4278 might require a more advanced approach).

## Integration

Users that start collators should barely observe differences based on
this logic, since the retry logic applies only in case the collators
fail to connect to the RPC servers. In practice I assume the RPC servers
are already live before starting collators, and the issue isn't visible.

## Review Notes

The added retry logic is for retrying the connection to the RPC servers
(which can be multiple). It is at the level of the
cumulus/client/relay-chain-rpc-interface module, but more specifically
relevant to the RPC clients logic (`ClientManager`). The retry logic is
not configurable, it tries to connect to the RPC client for 5 times,
with an exponential backoff in between each iteration starting with 1
second wait time and ending with 16 seconds. The same logic is applied
in case an existing connection to an RPC is dropped. There is a
`ReconnectingWebsocketWorker` who ensures there is connectivity to at
least on RPC node, and the retry logic makes this stronger by insisting
on trying connections to the RPC servers list for 5 times.

## Testing

- This was tested manually by starting zombienet natively based on
[006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml)
and observing collators don't fail anymore:

```bash
zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml
```

- Added a unit test that exercises the retry logic for a client
connection to a server that comes online in 10 seconds. The retry logic
can wait for as long as 30 seconds, but thought that it is too much for
a unit test. Just being conscious of CI time if it runs this test, but I
am happy to see suggestions around it too. I am not that sure either it
runs in CI, haven't figured it out entirely yet. The test can be
considered an integration test too, but it exercises crate internal
implementation, not the public API.

Collators example logs after the change:
```
2024-08-29 14:28:11.730  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:12.737  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:12.739  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:12.755  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:14.758  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:14.759  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:14.760  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:18.766  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:26.770  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/"
```

---------

Signed-off-by: Iulian Barbu <[email protected]>
Co-authored-by: Sebastian Kunert <[email protected]>
@github-project-automation github-project-automation bot moved this from backlog to done in SDK Node Sep 3, 2024
@iulianbarbu iulianbarbu self-assigned this Sep 3, 2024
x3c41a pushed a commit that referenced this issue Sep 4, 2024
# Description

Adds retry logic that makes the RPC relay chain interface more reliable
for the cases of a collator connecting to external RPC servers.

Closes #5514 
Closes #4278

Final solution still debated on #5514 , what this PR addresses might
change (e.g. #4278 might require a more advanced approach).

## Integration

Users that start collators should barely observe differences based on
this logic, since the retry logic applies only in case the collators
fail to connect to the RPC servers. In practice I assume the RPC servers
are already live before starting collators, and the issue isn't visible.

## Review Notes

The added retry logic is for retrying the connection to the RPC servers
(which can be multiple). It is at the level of the
cumulus/client/relay-chain-rpc-interface module, but more specifically
relevant to the RPC clients logic (`ClientManager`). The retry logic is
not configurable, it tries to connect to the RPC client for 5 times,
with an exponential backoff in between each iteration starting with 1
second wait time and ending with 16 seconds. The same logic is applied
in case an existing connection to an RPC is dropped. There is a
`ReconnectingWebsocketWorker` who ensures there is connectivity to at
least on RPC node, and the retry logic makes this stronger by insisting
on trying connections to the RPC servers list for 5 times.

## Testing

- This was tested manually by starting zombienet natively based on
[006-rpc_collator_builds_blocks.toml](https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml)
and observing collators don't fail anymore:

```bash
zombienet -l text --dir zbn-run -f --provider native spawn polkadot-sdk/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml
```

- Added a unit test that exercises the retry logic for a client
connection to a server that comes online in 10 seconds. The retry logic
can wait for as long as 30 seconds, but thought that it is too much for
a unit test. Just being conscious of CI time if it runs this test, but I
am happy to see suggestions around it too. I am not that sure either it
runs in CI, haven't figured it out entirely yet. The test can be
considered an integration test too, but it exercises crate internal
implementation, not the public API.

Collators example logs after the change:
```
2024-08-29 14:28:11.730  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=0 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:12.737  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:12.739  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:12.755  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=1 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:14.758  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:14.759  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:14.760  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=2 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:18.766  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=0 url="ws://127.0.0.1:43617/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=1 url="ws://127.0.0.1:37965/"
2024-08-29 14:28:18.768  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=3 index=2 url="ws://127.0.0.1:37427/"
2024-08-29 14:28:26.770  INFO tokio-runtime-worker reconnecting-websocket-client: [Parachain] Trying to connect to next external relaychain node. current_iteration=4 index=0 url="ws://127.0.0.1:43617/"
```

---------

Signed-off-by: Iulian Barbu <[email protected]>
Co-authored-by: Sebastian Kunert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I5-enhancement An additional feature request. T0-node This PR/Issue is related to the topic “node”.
Projects
Status: done
Development

Successfully merging a pull request may close this issue.

2 participants