Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event monitor crashes when the RPC node goes away #863

Closed
5 tasks
ancazamfir opened this issue Apr 28, 2021 · 8 comments · Fixed by #895
Closed
5 tasks

Event monitor crashes when the RPC node goes away #863

ancazamfir opened this issue Apr 28, 2021 · 8 comments · Fixed by #895
Assignees
Labels
A: bug Admin: something isn't working E: gravity External: related to Gravity DEX I: logic Internal: related to the relaying logic
Milestone

Comments

@ancazamfir
Copy link
Collaborator

Crate

relayer

Summary of Bug

This was discovered by the DEX team during load testing and happens on hermes multi path startup.

Trace is:

The application panicked (crashed).
Message: called `Result::unwrap()` on an `Err` value: Error { code: ClientInternalError, message: "Client internal error", data: Some("failed to hear back from WebSocket driver") }
Location: relayer/src/chain/cosmos.rs:378
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it. Run with RUST_BACKTRACE=full to include source snippets.

Version

Steps to Reproduce

Acceptance Criteria


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@ancazamfir ancazamfir added this to the 04.2021 milestone Apr 28, 2021
@adizere adizere added A: bug Admin: something isn't working I: logic Internal: related to the relaying logic labels Apr 29, 2021
@adizere adizere self-assigned this Apr 29, 2021
@nodebreaker0-0
Copy link

Hello, I'm bug informant.

To reproduce this crashe, see

  1. Set up at least 3 chains

  2. Assuming chain-a, chain-b, chain-c,

  3. connect a and b , connect a and c
    hermes -c config.toml create channel chain-a chain-b --port-a transfer --port-b transfer -o unordered
    hermes -c config.toml create channel chain-a chain-c --port-a transfer --port-b transfer -o unordered

  4. Run Hermes with start-multi.

  5. Exit the chain-c

Then the crashe will be reproduced.

@andynog andynog added the E: gravity External: related to Gravity DEX label Apr 29, 2021
@adizere
Copy link
Member

adizere commented Apr 29, 2021

Thank you for the feedback, @nodebreaker0-0.

I would like to know what would be the required behavior in your opinion, if running hermes start-multi and one of the chains crashes. When that chain crashes, then Hermes cannot continue relaying packets for all paths involving that chain. Now I can imagine two options:

  1. The hermes relayer process could exit with an error.
  2. The hermes relayer process could continue operating if there are other paths that are alive (paths which still have the chains at both ends alive).

Any thoughts?

@ancazamfir
Copy link
Collaborator Author

I would say 2.
Also, it is possible that the chain is fine (continues to produce blocks) but the full node "crashes". In this case, once the full node is operational, would hermes in case 2. (and with your fix) be able to resume relaying over the paths spanning that chain?

@romac
Copy link
Member

romac commented Apr 29, 2021

We could perhaps introduce a thread to monitor the state of all configured chains and trigger a reload/restart when a crashed node comes back online.

@ancazamfir
Copy link
Collaborator Author

As an initial fix, maybe we do 2 without resuming relaying for the affected chain?

@gamarin2
Copy link

I confirm that 2. is much preferred.

@nodebreaker0-0
Copy link

In my opinion, number two is appropriate.

It is necessary to separate the rpc connection for each chain independently.

For example

Fundamental Problem Solving

  1. connect a and b , connect a and c

  2. Chain-c is dead

  3. Terminates all relay processes associated with chain-c.

  4. But the relay process of a and b must be alive.

If you think this is very complicated development.

Temporary Measure

Multiple rpc nodes in config.toml are required.

Then, if node 1 is unable to communicate rpc, you can try to connect to node 2 (or up to number 3 and 4).

In conclusion, however, the fundamental problem must be solved.

@ancazamfir
Copy link
Collaborator Author

What are the minimum requirements for recovery when the full node comes back? Is manual restart ok, similar to when a new chain is added? Or need something like @romac proposes here #863 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: bug Admin: something isn't working E: gravity External: related to Gravity DEX I: logic Internal: related to the relaying logic
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants