-
Notifications
You must be signed in to change notification settings - Fork 3.8k
2.0.0 resynching too slowly #8464
Comments
It would be great if internal p2p nodes to sync with each other using 100% CPU. However, as someone who has 500 connected p2p peers on EOS Mainnet, it is also important that the protocol not abuse public p2p providers. So if there is a view to look up speeding up transfers between trusted peers on a local network (good idea), then also please make sure it doesn't mean that public p2p providers could get abused by someone syncing from block #1 to current. |
but it's the receiving node which is mostly busy because it needs to process those blocks and evaluate every transaction. And it doesn't get enough blocks to do the job on time |
I just did start from snapshot on 2.0 syncing blocks from a 2.0 peer. CPU was maxed on the node receiving the blocks. So I am not seeing this problem. |
@matthewdarwin maybe it had an incoming p2p connection and synched against it quickly |
All my internal p2p connections are bi-directional (A connects to B and B connects to A)... so hard to say if it was "incoming" or "outgoing" connection. |
And thinking about it, I probably rarely have nodeos in a state where it needs to sync to get many older blocks because if I start a new node, then I am starting from a backup, a ZFS snapshot or a nodeos snapshot. |
probably the live network conditions are different, because I cannot reproduce the issue. here's test with 2.0.0, and I'll do the same with 2.0.1 tomorrow:
4000 blocks synced in 2 minutes, the p2p peer is in the same datacenter. |
I took also a snapshot from 2020-01-17, but still can't reproduce the issue between two 2.0.0 servers. It's probably related to the live network traffic and frequency of forks at that time. |
synching 2.0.0 against a remote 2.0.1 gives the same result. Now trying 2.0.1 from 2.0.1 |
same result for 2.0.1 vs. 2.0.1 seems like the problem is related to the harsh network condition at the time of sumbission |
closing, can't reproduce again |
I believe I have reproduced this issue.
Possible fixes: either of these:
|
I upgraded a dapp infrastructure that consisted of several 1.8 nodes on EOS mainnet. All servers are in the same datacenter. I deleted the old data, downloaded a recent snapshot from EOS Nation and EOS Sweden, then enabled
wasm-runtime = eos-vm-jit
,eos-vm-oc-enable = true
and started nodeos from snapshot. Some nodes have state history plugin enabled, others do not.So, first upgrades went alright and synched against public p2p quite quickly, although not occupying the CPU at 100% all the time.
The last servers synched very slowly while I tried to use direct peers in the same datacenter. It looks like a 2.0 node synching from 2.0 is affected, and when the remote peer is 1.8 it synchronizes much faster.
So, even with a remote peer in the same datacenter over a gigabit link, and only one peer configured, the node goes to 100% CPU for few seconds and then waits for something, keeping the CPU at 2% for about 10 seconds. During this idling time, the head is not advancing,. Then it resumes for few seconds and then idles again. This results in a very slow resync.
I tried various values for
sync-fetch-span
between 50 and 500, and it didn't change the picture.The text was updated successfully, but these errors were encountered: