Discussion: RPC server automatic recovery #38

hiveuprss · 2023-05-19T19:25:40Z

Wanted to start a discussion to see if anyone has ideas on how to improve the reliability of the RPC server.

The recent issues with batch requests caused the behavior where the RPC server stops responding and does not auto-recover. PM2 does not detect the issue and does not restart the service automatically. The node operator has to notice there is an issue and go call pm2 restart. Not ideal when it happens in the middle of the night.

How can we make it more resilient? Can the node detect the problem and restart itself? Currently, the death of the RPC server does not crash the whole service, so PM2/docker cannot know if it needs to be restarted.

Maybe needs more investigation to understand why the server stops responding. without throwing exception

Rishi556 · 2023-05-20T01:44:05Z

I'd rather prefer something other than automated restarts, that doesn't seem like a fix, just a bandaid.

forkyishere · 2023-05-20T05:29:16Z

Is the reason for the RPC dying identified? Because that might be the more important problem to solve. I agree that just normal restarts might not be a good path. This usually means less attention to the node itself...

Monitoring of the RPC dying could be something improved, so that pm2 knows what components of a app die... not sure if this is easy to do within 1 app on pm2... sort of like systemctl allows for processes dependencies and other services.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: RPC server automatic recovery #38

Discussion: RPC server automatic recovery #38

hiveuprss commented May 19, 2023

Rishi556 commented May 20, 2023

forkyishere commented May 20, 2023

Discussion: RPC server automatic recovery #38

Discussion: RPC server automatic recovery #38

Comments

hiveuprss commented May 19, 2023

Rishi556 commented May 20, 2023

forkyishere commented May 20, 2023