Node randomly stops syncing, after restart it's fine (for some time) #1962

bb4L · 2022-12-15T14:17:58Z

Summary of Bug

I'm running a cosmos node and occasionally (now at least once a day) it just stops syncing, in the logs i can see some

2:11PM ERR Connection failed @ sendRoutine conn={"Logger":{}} err="pong timeout" module=p2p peer={"id":"5dc6a28f2caff8e61c47c1c9b658e7b1ea5fbfd9","ip":"5.9.42.116","port":26656}

and

2:11PM ERR Stopping peer for error err=EOF module=p2p peer={"Data":{},"Logger":{}}

It doesn't recover by itself, the only way to get it back synced is to restart it (the container)

EDIT:
restart doesn't always immediately help, i get the same logs for the connections

i also just tried with a newly downloaded addrbook.json

Version

v7.1.0

Steps to Reproduce

i'm just running a node with gaiad start --x-crisis-skip-assert-invariants

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned
Is a spike necessary to map out how the issue should be approached?

The text was updated successfully, but these errors were encountered:

joslee7410 · 2022-12-28T16:04:40Z

Same here, seems the node will stop getting new block and also tendermint rpc will down. (default 26656). Left rest api (default 1317) working.

yj0x0x · 2023-01-02T10:09:17Z

Sync status not stable and so slow.

Daniel1984 · 2023-01-03T11:06:07Z

same issue here. Using addrbook.json in config dir, then trying to assigning dedicated list of seeds in config/config.toml all end up running fine for some time after restart and then node just stops accepting rpc calls and becomes unresponsive.

Daniel1984 · 2023-01-03T11:09:10Z

@bb4L have you managed to resolve the issue? Not only it happened in my prod setup but I also run into same issue on the fresh install

bb4L · 2023-01-03T13:28:22Z

@bb4L have you managed to resolve the issue? Not only it happened in my prod setup but I also run into same issue on the fresh install

unfortunately no, it's still happening quite often, i have two different nodes one of them being a fresh install (let's say couple of weeks old...)

nddeluca · 2023-01-18T15:45:32Z

We also ran into this issue -- our node would never make it 24 hours without halting syncing.

We resolved it by switching to rocksdb, using the address book at https://polkachu.com/addrbooks/cosmos, and increasing the number of outbound peers to 200. The node has now been stable for 4+ days.

Daniel1984 · 2023-01-25T11:18:26Z

that didn't help and according to their dev team should not be a root cause of an issue. The only way to keep it running was using a bash script to periodically query :1317/cosmos/base/tendermint/v1beta1/blocks/latest resource and when that times out, kill gaiad and restart it.

mmulji-ic · 2023-02-03T11:24:45Z

@thanethomson is this a known issue for Tendermint?

joslee7410 · 2023-02-19T17:34:35Z

Sorry for disturb, still nobody found out the solution?

Is it possible due to using 1317 port (API PORT), that might cause the service stop syncing?

Thank you.

mpoke · 2023-03-09T09:26:56Z

@bb4L could you please let us know if the issue still exist with v8.0.1?

bb4L · 2023-03-09T10:37:49Z

@bb4L could you please let us know if the issue still exist with v8.0.1?

@mpoke yes it's still the same

mmulji-ic · 2023-03-20T11:13:24Z

Hi @bb4L we're checking with the Tendermint team. I'll get back to you once I have some more info.

adizere · 2023-03-20T11:48:04Z

Hi, can someone provide a minimal way to reproduce this issue? We'd be glad to look into this, but we need that first. The config.toml with peers, gaia version, etc. Many thanks!

mmulji-ic · 2023-03-20T12:04:56Z

@adizere would you recommend doing a tendermint debug dump?

bb4L · 2023-03-21T07:52:55Z

@mmulji-ic
thanks for the information, let me know if you need something from my side

@adizere

Hi, can someone provide a minimal way to reproduce this issue? We'd be glad to look into this, but we need that first. The config.toml with peers, gaia version, etc. Many thanks!

gaia version: 7.1.0, 8.0.1 as well as 9.0.0 (as written in the issue/other comments)

config.toml has no peers section (at least mine hasn't)
output of cat config.toml | grep peer:

# If true, query the ABCI app on connecting to a new peer
filter_peers = false
# Address to advertise to peers for them to dial
persistent_peers = ""
# Maximum number of inbound peers
max_num_inbound_peers = 40
# Maximum number of outbound peers to connect to, excluding persistent peers
max_num_outbound_peers = 10
unconditional_peer_ids = ""
# Maximum pause when redialing a persistent peer (if zero, exponential backoff is used)
persistent_peers_max_dial_period = "0s"
# Set true to enable the peer-exchange reactor
# peers. If another node asks it for addresses, it responds and disconnects.
# Does not work if the peer-exchange reactor is disabled.
# Comma separated list of peer IDs to keep private (will not be gossiped to other peers)
private_peer_ids = ""
# Toggle to disable guard against peers connecting from the same ip.
# Maximum size of a batch of transactions to send to a peer
# snapshot from peers instead of fetching and replaying historical blocks. Requires some peers in
# peer (default: 1 minute).
peer_gossip_sleep_duration = "100ms"
peer_query_maj23_sleep_duration = "2s"

minimal way to reproduce, i guess just try to run a node 🤷🏽‍♂️

Daniel1984 · 2023-03-21T10:41:32Z

I have 2 nodes running gaiad 9.0.1 and on both tendermint seems to stop working frequently and http://localhost:1317/cosmos/base/tendermint/v1beta1/blocks/latest requests start to timeout. I run bash script that restarts gaiad when this resource times out. One of 2 nodes (4cpu/16gb ram) was running fine since last Tuesday up until yesterday evening. Since yesterday it seems that it dies every ~3-5min. Standard config.

adizere · 2023-03-27T09:00:11Z

Thanks @bb4L @Daniel1984, we're actively monitoring our node but have not seen the behavior you're reporting yet.

Do you monitor your node? Wondering if the problem is not under-resourcing, i.e., the virtual machine on which your node is running might be unable to keep up with the network. Would be good to check how the cpu/memory profile looks like to eliminate that potential root cause!

@adizere would you recommend doing a tendermint debug dump?

Would be very helpful indeed!

bb4L · 2023-03-28T05:19:49Z

Do you monitor your node? Wondering if the problem is not under-resourcing, i.e., the virtual machine on which your node is running might be unable to keep up with the network. Would be good to check how the cpu/memory profile looks like to eliminate that potential root cause!

cpu / memory looks fine on my instance(s)

mmulji-ic · 2023-04-04T11:48:15Z

@Daniel1984 we discussed this a bit in the Telegram channel, @MSalopek did a bit of an investigation, @nddeluca @bb4L it would be good to check the number of incoming rest/grpc calls, what endpoint they are using and seeing if pagination limits have a beneficial effect as @MSalopek noted below.

Reporting that after disabling REST and gRPC the node functions as expected without hiccups.
Best advice I can give is to setup a loadbalancer/proxy (such as nginx or cloudflare) and set a rate-limiting system for your production nodes. It's known that an RPC server can be brought down with expensive queries - that is not necessarily a gaia issue, it's possible on all cosmos-sdk based networks

@adizere could you also replicate with some heavy calls to the rest endpoints and see how the performance is impacted?

bb4L · 2023-04-05T09:11:43Z

@Daniel1984 we discussed this a bit in the Telegram channel, @MSalopek did a bit of an investigation, @nddeluca @bb4L it would be good to check the number of incoming rest/grpc calls, what endpoint they are using and seeing if pagination limits have a beneficial effect as @MSalopek noted below.
Reporting that after disabling REST and gRPC the node functions as expected without hiccups.
Best advice I can give is to setup a loadbalancer/proxy (such as nginx or cloudflare) and set a rate-limiting system for your production nodes. It's known that an RPC server can be brought down with expensive queries - that is not necessarily a gaia issue, it's possible on all cosmos-sdk based networks
@adizere could you also replicate with some heavy calls to the rest endpoints and see how the performance is impacted?

for me the effect is also on nodes which aren't used by applications (so it can't be a only load related issue)

adizere · 2023-04-14T13:26:39Z

for me the effect is also on nodes which aren't used by applications (so it can't be a only load related issue)

Is there a way to trigger the problem? It seems like one way to reproduce the problem is by increasing the RPC (REST/gRPC) load on the node. Without that kind of pressure, are there other means to trigger this issue?

@adizere could you also replicate with some heavy calls to the rest endpoints and see how the performance is impacted?

I think it's known already that the JSON/RPC endpoints that CometBFT exposes (also REST and gRPC which are at the app-level) are not meant to be exposed without protection against abuse by front-end users. See "DoS Exposure and Mitigation" here. I'll think if there's additional context we can add in that section of the docs. Any thoughts/specific feedback welcome!

bb4L · 2023-04-17T08:09:55Z

Is there a way to trigger the problem? It seems like one way to reproduce the problem is by increasing the RPC (REST/gRPC) load on the node. Without that kind of pressure, are there other means to trigger this issue?

can't tell since it's happening without me doing anything... / without having a high rpc load

gaia · 2023-04-18T23:05:57Z

I can assure you that I have noticed this same issue on other Cosmos SDK chains (Secret and Terra2) several times. This is not gaia specific, there is something else upstream. It started happening a couple of months ago. I am sorry I have not been able to narrow it down aside from time and chains in which we've seen this exact issue happening.

staking-explorer · 2023-05-04T09:26:55Z

We have the same problem when running chihuahuad (based on cosmos SDK). The problem ocurs for us only when REST API enabled and some application tries to download all accounts using endpoint: "cosmos/auth/v1beta1/accounts" (paginated). At this moment in our node logs we can see this output:
May 04 09:01:23 chihuahua chihuahuad[674]: 9:01AM ERR Connection failed @ sendRoutine conn={"Logger":{}} err="pong timeout" module=p2p peer={"id":"28c227d31064e4bacb366055d796f0c3064c1db0","ip":"149.202.72.186","port":26613}
May 04 09:01:26 chihuahua chihuahuad[674]: 9:01AM INF service stop impl={"Logger":{}} module=p2p msg={} peer={"id":"28c227d31064e4bacb366055d796f0c3064c1db0","ip":"149.202.72.186","port":26613}
May 04 09:01:27 chihuahua chihuahuad[674]: 9:01AM ERR Stopping peer for error err="pong timeout" module=p2p peer={"Data":{},"Logger":{}}
May 04 09:01:30 chihuahua chihuahuad[674]: 9:01AM INF service stop impl={"Data":{},"Logger":{}} module=p2p msg={} peer={"id":"28c227d31064e4bacb366055d796f0c3064c1db0","ip":"149.202.72.186","port":26613}
May 04 09:01:44 chihuahua systemd[1]: node.service: Main process exited, code=killed, status=9/KILL
May 04 09:01:44 chihuahua systemd[1]: node.service: Failed with result 'signal'.

Disabling API is fully solving the problem. Upscaling VPS to from 4\8 to 16 cores\64GB RAM not solving the problem. This issue for all COSMOS SDK projects. Seems that the issue may be closed.

mmulji-ic · 2023-06-27T08:37:37Z

We have the same problem when running chihuahuad (based on cosmos SDK). The problem ocurs for us only when REST API enabled and some application tries to download all accounts using endpoint: "cosmos/auth/v1beta1/accounts" (paginated). At this moment in our node logs we can see this output: May 04 09:01:23 chihuahua chihuahuad[674]: 9:01AM ERR Connection failed @ sendRoutine conn={"Logger":{}} err="pong timeout" module=p2p peer={"id":"28c227d31064e4bacb366055d796f0c3064c1db0","ip":"149.202.72.186","port":26613} May 04 09:01:26 chihuahua chihuahuad[674]: 9:01AM INF service stop impl={"Logger":{}} module=p2p msg={} peer={"id":"28c227d31064e4bacb366055d796f0c3064c1db0","ip":"149.202.72.186","port":26613} May 04 09:01:27 chihuahua chihuahuad[674]: 9:01AM ERR Stopping peer for error err="pong timeout" module=p2p peer={"Data":{},"Logger":{}} May 04 09:01:30 chihuahua chihuahuad[674]: 9:01AM INF service stop impl={"Data":{},"Logger":{}} module=p2p msg={} peer={"id":"28c227d31064e4bacb366055d796f0c3064c1db0","ip":"149.202.72.186","port":26613} May 04 09:01:44 chihuahua systemd[1]: node.service: Main process exited, code=killed, status=9/KILL May 04 09:01:44 chihuahua systemd[1]: node.service: Failed with result 'signal'.

Disabling API is fully solving the problem. Upscaling VPS to from 4\8 to 16 cores\64GB RAM not solving the problem. This issue for all COSMOS SDK projects. Seems that the issue may be closed.

@bb4L this is the conclusion that we ended with, that theres an interplay with network traffic and the node performance. This is a tendermint / comet level issue, that we think has been addressed in versions after v8. Currently, v8 / v9 are not supported in production, only for archive related issues, therefore closing this issue. for future versions, we will ask the Comet team to include longer term tests with heavy rpc / rest loads to confirm that there is no regression and that the performance characteristics are understood.

mmulji-ic · 2023-06-27T09:41:02Z

Related to issue #1962

mpoke added the type: bug Issues that need priority attention -- something isn't working label Jan 20, 2023

mpoke added this to Cosmos Hub Jan 20, 2023

github-project-automation bot moved this to 🩹 Triage in Cosmos Hub Jan 20, 2023

mpoke moved this from 🩹 Triage to 📥 Todo in Cosmos Hub Jan 20, 2023

mpoke moved this from 📥 Todo to 🩹 Triage in Cosmos Hub Jan 20, 2023

mpoke moved this from 🩹 Triage to 🛑 Blocked in Cosmos Hub Mar 9, 2023

mpoke moved this from 🛑 Blocked to 📥 Todo in Cosmos Hub Mar 9, 2023

mpoke assigned mmulji-ic Mar 9, 2023

mpoke moved this from 📥 Todo to 🏗 In progress in Cosmos Hub Mar 20, 2023

mmulji-ic moved this from 🏗 In progress to 🛑 Blocked in Cosmos Hub Apr 4, 2023

mpoke mentioned this issue Apr 14, 2023

Incoming user issues Q-2.23 #2405

Closed

mmulji-ic added the scope: comet-bft label Jun 27, 2023

mmulji-ic closed this as completed Jun 27, 2023

github-project-automation bot moved this from 🛑 Blocked to ✅ Done in Cosmos Hub Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node randomly stops syncing, after restart it's fine (for some time) #1962

Node randomly stops syncing, after restart it's fine (for some time) #1962

bb4L commented Dec 15, 2022 •

edited

Loading

joslee7410 commented Dec 28, 2022

yj0x0x commented Jan 2, 2023

Daniel1984 commented Jan 3, 2023

Daniel1984 commented Jan 3, 2023

bb4L commented Jan 3, 2023

nddeluca commented Jan 18, 2023

Daniel1984 commented Jan 25, 2023

mmulji-ic commented Feb 3, 2023

joslee7410 commented Feb 19, 2023

mpoke commented Mar 9, 2023

bb4L commented Mar 9, 2023

mmulji-ic commented Mar 20, 2023

adizere commented Mar 20, 2023

mmulji-ic commented Mar 20, 2023

bb4L commented Mar 21, 2023

Daniel1984 commented Mar 21, 2023

adizere commented Mar 27, 2023

bb4L commented Mar 28, 2023

mmulji-ic commented Apr 4, 2023

bb4L commented Apr 5, 2023

adizere commented Apr 14, 2023

bb4L commented Apr 17, 2023

gaia commented Apr 18, 2023

staking-explorer commented May 4, 2023

mmulji-ic commented Jun 27, 2023

mmulji-ic commented Jun 27, 2023

Node randomly stops syncing, after restart it's fine (for some time) #1962

Node randomly stops syncing, after restart it's fine (for some time) #1962

Comments

bb4L commented Dec 15, 2022 • edited Loading

Summary of Bug

Version

Steps to Reproduce

For Admin Use

joslee7410 commented Dec 28, 2022

yj0x0x commented Jan 2, 2023

Daniel1984 commented Jan 3, 2023

Daniel1984 commented Jan 3, 2023

bb4L commented Jan 3, 2023

nddeluca commented Jan 18, 2023

Daniel1984 commented Jan 25, 2023

mmulji-ic commented Feb 3, 2023

joslee7410 commented Feb 19, 2023

mpoke commented Mar 9, 2023

bb4L commented Mar 9, 2023

mmulji-ic commented Mar 20, 2023

adizere commented Mar 20, 2023

mmulji-ic commented Mar 20, 2023

bb4L commented Mar 21, 2023

Daniel1984 commented Mar 21, 2023

adizere commented Mar 27, 2023

bb4L commented Mar 28, 2023

mmulji-ic commented Apr 4, 2023

bb4L commented Apr 5, 2023

adizere commented Apr 14, 2023

bb4L commented Apr 17, 2023

gaia commented Apr 18, 2023

staking-explorer commented May 4, 2023

mmulji-ic commented Jun 27, 2023

mmulji-ic commented Jun 27, 2023

bb4L commented Dec 15, 2022 •

edited

Loading