Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make supervisor more resilient to node going down #903

Merged
merged 16 commits into from
May 6, 2021

Conversation

romac
Copy link
Member

@romac romac commented May 6, 2021

Follow-up to #895
See also #871

Description

The supervisor should now be more resilient to a node going down temporarily.
Instead of sitting there waiting for events via the subscription, the supervisor is
now notified that something went wrong, while the event monitor will attempt to
reconnect for a limited time (max retries with a delay between attempts).

Errors yielded by the client and packet workers are now caught at the top-level run loop of the worker,
and printed to the console rather than causing the worker to exits. This is quite brittle still
and will need more work and thought put into for the next milestone.

Tested with

  1. setup the chains, clients, connections, and channels:
❯ ./scripts/dev-env ibc-0 ibc-1 ibc-2
❯ hermes create channel ibc-0 ibc-1 --port-a transfer --port-b transfer -o unordered
❯ hermes create channel ibc-1 ibc-2 --port-a transfer --port-b transfer -o unordered
  1. in a new console, with the debug log level in the config:
❯ hermes start-multi
  1. send some packets if you want:
❯ hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 1000 -n 5
...
  1. kill one of the node, eg. ibc-1:
❯ ps aux | rg gaiad | rg ibc-1 | awk '{ print $2 }' | xargs -I{} kill -9 {}
  1. watch the output of start-multi, it should show some errors about the WebSocket connection being down and perhaps some RPC queries failing but keep going and retrying to connect to the WebSocket.

  2. start the nodes again (will automatically kill the remaining ones):

❯ ./scripts/dev-env ibc-0 ibc-1 ibc-2
❯ hermes create channel ibc-0 ibc-1 --port-a transfer --port-b transfer -o unordered
❯ hermes create channel ibc-1 ibc-2 --port-a transfer --port-b transfer -o unordered
  1. after a while it should be able to reconnect and you can send some packets
❯ hermes tx raw ft-transfer ibc-1 ibc-0 transfer channel-0 9999 1000 -n 5

For contributor use:

  • Updated the Unreleased section of CHANGELOG.md with the issue.
  • If applicable: Unit tests written, added test to CI.
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments.
  • Re-reviewed Files changed in the Github PR explorer.

@romac romac marked this pull request as ready for review May 6, 2021 07:55
@romac romac requested a review from ancazamfir as a code owner May 6, 2021 07:55
@romac romac requested a review from adizere May 6, 2021 07:55
relayer/src/event/monitor.rs Outdated Show resolved Hide resolved
relayer/src/chain/handle.rs Outdated Show resolved Hide resolved
@ancazamfir
Copy link
Collaborator

I still see an error when starting with 3 chains in config but only two gaia processes. The output is:

May 06 15:22:30.864  INFO ibc_relayer_cli::commands: Using default configuration from: '.hermes/config.toml'
May 06 15:22:30.879 DEBUG ibc_relayer::event::monitor: subscribing to query: tm.event = 'Tx'
May 06 15:22:30.880 DEBUG ibc_relayer::event::monitor: subscribing to query: tm.event = 'NewBlock'
May 06 15:22:30.881 DEBUG ibc_relayer::event::monitor: subscribed to all queries
May 06 15:22:30.881  INFO ibc_relayer::event::monitor: starting event monitor chain.id=ibc-0
May 06 15:22:30.881 TRACE ibc_relayer::registry: spawned chain runtime for chain identifier ibc-0
May 06 15:22:30.885 DEBUG ibc_relayer::event::monitor: subscribing to query: tm.event = 'Tx'
May 06 15:22:30.886 DEBUG ibc_relayer::event::monitor: subscribing to query: tm.event = 'NewBlock'
May 06 15:22:30.887 DEBUG ibc_relayer::event::monitor: subscribed to all queries
May 06 15:22:30.887 TRACE ibc_relayer::registry: spawned chain runtime for chain identifier ibc-1
May 06 15:22:30.887  INFO ibc_relayer::event::monitor: starting event monitor chain.id=ibc-1
Error: RPC error to endpoint http://127.0.0.1:26457/: error trying to connect: tcp connect error: Connection refused (os error 61) (code: 0)

I believe the error comes from here (see ?):
https://github.com/informalsystems/ibc-rs/blob/ecbdb07a8bbed69e9cd9f88778a4b50edf0f5567/relayer/src/supervisor.rs#L332-L334

We should ignore the error...maybe create subscription as part of spawn_workers() below since we iter the chains there anyway.

@romac
Copy link
Member Author

romac commented May 6, 2021

@ancazamfir Should be fixed in 493191b.

@ancazamfir
Copy link
Collaborator

Another nice to have is this: If there are 3 chains in config and only two chains/ gaiad nodes are up then I should see the same behavior regardless on how this state was reached. Right now:

  • when hermes starts with 2 nodes up and 3 chains, it tries to init 3rd chain and then it relays only for the two chains. It doesn't retry to reconnect with the third chain.
  • when hermes starts with 3 nodes and then one is killed, it relays over the two chains but keeps retrying to reconnect with the third.

@ancazamfir
Copy link
Collaborator

Also the trying to reconnect to WebSocket end... and error when reconnecting are too often maybe especially because they are warn/ error and happen every 5 sec (exponential backoff would be nice here) or maybe give up after a while.

Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should merge this.

Will open a follow-up PR to address Anca's comment regarding the backoff for the worker (error when reconnecting).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants