[node] Add checks for RPC node health #475

AntiD2ta · 2024-04-11T18:54:27Z

Why are these changes needed?

EigenDA relies on a healthy JSON-RPC endpoint to conduct many of its operations. Currently, such an endpoint could be unhealthy, and while EigenDA will throw errors, these won't tell if they are related to the JSON-RPC being unhealthy. In this case, the EigenDA operator can suspect it could be an EigenDA bug, an issue on the Dispersers, JSON-RPC endpoint, a networking issue in the machine, etc.

If EigenDA can conduct health checks on the JSON-RPC node (by checking the eth_syncing endpoint), troubleshooting would be considerably more robust.

This PR add such health checks in the following places:

When initializing the EthClient
As a goroutine periodically checking if the JSON-RPC endpoint is healthy
Optionally when checking if significant operations failed due to the endpoint being unhealthy (ValidateBatch)

The following logs shows instances of the aforementioned changes (tested on Nethermind's Holesky EigenDA node):

JSON-RPC node unhealthy

2024/04/11 18:35:02 Initializing Node
time=2024-04-11T18:35:03.953Z level=INFO source=/app/common/geth/instrumented_client.go:52 msg="Checking if eth client is online" online=false err=<nil>
2024/04/11 18:35:03 application failed: cannot create chain.Client: the RPC node is not synced. The node will not be able to process batches successfully until it is synced

Goroutine check

time=2024-04-11T18:36:38.050Z level=INFO source=/app/node/node.go:469 msg="Start checkRPCNodeSynced goroutine in background to periodically check if the RPC node is synced and online" component=Node
time=2024-04-11T18:36:38.050Z level=INFO source=/app/node/node.go:450 msg="Start checkCurrentNodeIp goroutine in background to detect the current public IP of the operator node" component=Node
time=2024-04-11T18:36:38.050Z level=INFO source=/app/node/node.go:252 msg="Start expireLoop goroutine in background to periodically remove expired batches on the node" component=Node
time=2024-04-11T18:36:38.050Z level=INFO source=/app/node/node.go:427 msg="Start checkRegisteredNodeIpOnChain goroutine in background to subscribe the operator socket change events onchain" component=Node

This fork use Bump v0.6.1 (#458) as a stable reference.

This PR also introduces a Chain ID check at the EigenDA initialization for the Operators to double check if they target JSON-RPC endpoint is pointing to the proper Network:

2024/04/11 18:09:55 Initializing Node
time=2024-04-11T18:09:57.489Z level=INFO source=/app/node/node.go:108 msg="Detected network of configured RPC Node" network=Mainnet
2024/04/11 18:09:57 application failed: no contract code at given address
time=2024-04-11T18:09:57.727Z level=ERROR source=/app/core/eth/tx.go:750 msg="Failed to fetch DelegationManager address" component=Transactor err="no contract code at given address"

In the above logs, EigenDA is intended to be used for Holesky but a Mainnet JSON-RPC node was used. The node is synced and healthy, but the Transactor fails due to being on Mainnet instead of Holesky. This would help to tell the Operator rapidly what the issue is.

Checks

I've made sure the lint is passing in this PR.
I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
Testing Strategy
- Unit tests
- Integration tests
- This PR is not tested :(

jianoaix · 2024-04-19T22:24:11Z

Thank you for contribution @AntiD2ta. From what I understand, the periodic health check of Chain RPC should be handled via this: #502
Operator can create monitoring/alerting for error logs with "Failed to query chain RPC for...", which will be logged if the periodic check fails.

[node] Add checks for RPC node health

ada9799

shrimalmadhur requested review from jianoaix and ian-shim April 11, 2024 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[node] Add checks for RPC node health #475

[node] Add checks for RPC node health #475

AntiD2ta commented Apr 11, 2024

jianoaix commented Apr 19, 2024

[node] Add checks for RPC node health #475

Are you sure you want to change the base?

[node] Add checks for RPC node health #475

Conversation

AntiD2ta commented Apr 11, 2024

Why are these changes needed?

Checks

jianoaix commented Apr 19, 2024