Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[node] Add checks for RPC node health #475

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AntiD2ta
Copy link

Why are these changes needed?

EigenDA relies on a healthy JSON-RPC endpoint to conduct many of its operations. Currently, such an endpoint could be unhealthy, and while EigenDA will throw errors, these won't tell if they are related to the JSON-RPC being unhealthy. In this case, the EigenDA operator can suspect it could be an EigenDA bug, an issue on the Dispersers, JSON-RPC endpoint, a networking issue in the machine, etc.

If EigenDA can conduct health checks on the JSON-RPC node (by checking the eth_syncing endpoint), troubleshooting would be considerably more robust.

This PR add such health checks in the following places:

  • When initializing the EthClient
  • As a goroutine periodically checking if the JSON-RPC endpoint is healthy
  • Optionally when checking if significant operations failed due to the endpoint being unhealthy (ValidateBatch)

The following logs shows instances of the aforementioned changes (tested on Nethermind's Holesky EigenDA node):

JSON-RPC node unhealthy

2024/04/11 18:35:02 Initializing Node
time=2024-04-11T18:35:03.953Z level=INFO source=/app/common/geth/instrumented_client.go:52 msg="Checking if eth client is online" online=false err=<nil>
2024/04/11 18:35:03 application failed: cannot create chain.Client: the RPC node is not synced. The node will not be able to process batches successfully until it is synced

Goroutine check

time=2024-04-11T18:36:38.050Z level=INFO source=/app/node/node.go:469 msg="Start checkRPCNodeSynced goroutine in background to periodically check if the RPC node is synced and online" component=Node
time=2024-04-11T18:36:38.050Z level=INFO source=/app/node/node.go:450 msg="Start checkCurrentNodeIp goroutine in background to detect the current public IP of the operator node" component=Node
time=2024-04-11T18:36:38.050Z level=INFO source=/app/node/node.go:252 msg="Start expireLoop goroutine in background to periodically remove expired batches on the node" component=Node
time=2024-04-11T18:36:38.050Z level=INFO source=/app/node/node.go:427 msg="Start checkRegisteredNodeIpOnChain goroutine in background to subscribe the operator socket change events onchain" component=Node

This fork use Bump v0.6.1 (#458) as a stable reference.

This PR also introduces a Chain ID check at the EigenDA initialization for the Operators to double check if they target JSON-RPC endpoint is pointing to the proper Network:

2024/04/11 18:09:55 Initializing Node
time=2024-04-11T18:09:57.489Z level=INFO source=/app/node/node.go:108 msg="Detected network of configured RPC Node" network=Mainnet
2024/04/11 18:09:57 application failed: no contract code at given address
time=2024-04-11T18:09:57.727Z level=ERROR source=/app/core/eth/tx.go:750 msg="Failed to fetch DelegationManager address" component=Transactor err="no contract code at given address"

In the above logs, EigenDA is intended to be used for Holesky but a Mainnet JSON-RPC node was used. The node is synced and healthy, but the Transactor fails due to being on Mainnet instead of Holesky. This would help to tell the Operator rapidly what the issue is.

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

@jianoaix
Copy link
Contributor

Thank you for contribution @AntiD2ta. From what I understand, the periodic health check of Chain RPC should be handled via this: #502
Operator can create monitoring/alerting for error logs with "Failed to query chain RPC for...", which will be logged if the periodic check fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants