-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make supervisor more resilient to node going down #903
Conversation
I still see an error when starting with 3 chains in config but only two gaia processes. The output is:
I believe the error comes from here (see We should ignore the error...maybe create subscription as part of |
@ancazamfir Should be fixed in 493191b. |
Another nice to have is this: If there are 3 chains in config and only two chains/ gaiad nodes are up then I should see the same behavior regardless on how this state was reached. Right now:
|
Also the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should merge this.
Will open a follow-up PR to address Anca's comment regarding the backoff for the worker (error when reconnecting
).
Follow-up to #895
See also #871
Description
The supervisor should now be more resilient to a node going down temporarily.
Instead of sitting there waiting for events via the subscription, the supervisor is
now notified that something went wrong, while the event monitor will attempt to
reconnect for a limited time (max retries with a delay between attempts).
Errors yielded by the client and packet workers are now caught at the top-level run loop of the worker,
and printed to the console rather than causing the worker to exits. This is quite brittle still
and will need more work and thought put into for the next milestone.
Tested with
ibc-1
:watch the output of
start-multi
, it should show some errors about the WebSocket connection being down and perhaps some RPC queries failing but keep going and retrying to connect to the WebSocket.start the nodes again (will automatically kill the remaining ones):
For contributor use:
docs/
) and code comments.Files changed
in the Github PR explorer.