2.0.0 resynching too slowly #8464

cc32d9 · 2020-01-17T11:30:51Z

I upgraded a dapp infrastructure that consisted of several 1.8 nodes on EOS mainnet. All servers are in the same datacenter. I deleted the old data, downloaded a recent snapshot from EOS Nation and EOS Sweden, then enabled wasm-runtime = eos-vm-jit, eos-vm-oc-enable = true and started nodeos from snapshot. Some nodes have state history plugin enabled, others do not.

So, first upgrades went alright and synched against public p2p quite quickly, although not occupying the CPU at 100% all the time.

The last servers synched very slowly while I tried to use direct peers in the same datacenter. It looks like a 2.0 node synching from 2.0 is affected, and when the remote peer is 1.8 it synchronizes much faster.

So, even with a remote peer in the same datacenter over a gigabit link, and only one peer configured, the node goes to 100% CPU for few seconds and then waits for something, keeping the CPU at 2% for about 10 seconds. During this idling time, the head is not advancing,. Then it resumes for few seconds and then idles again. This results in a very slow resync.

I tried various values for sync-fetch-span between 50 and 500, and it didn't change the picture.

The text was updated successfully, but these errors were encountered:

matthewdarwin · 2020-01-17T13:55:53Z

It would be great if internal p2p nodes to sync with each other using 100% CPU. However, as someone who has 500 connected p2p peers on EOS Mainnet, it is also important that the protocol not abuse public p2p providers.

So if there is a view to look up speeding up transfers between trusted peers on a local network (good idea), then also please make sure it doesn't mean that public p2p providers could get abused by someone syncing from block #1 to current.

cc32d9 · 2020-01-17T16:42:52Z

but it's the receiving node which is mostly busy because it needs to process those blocks and evaluate every transaction. And it doesn't get enough blocks to do the job on time

matthewdarwin · 2020-01-17T20:27:34Z

I just did start from snapshot on 2.0 syncing blocks from a 2.0 peer. CPU was maxed on the node receiving the blocks. wasm-runtime = eos-vm-jit, eos-vm-oc-enable = true

So I am not seeing this problem.

cc32d9 · 2020-01-18T13:49:59Z

@matthewdarwin maybe it had an incoming p2p connection and synched against it quickly

matthewdarwin · 2020-01-18T14:18:55Z

All my internal p2p connections are bi-directional (A connects to B and B connects to A)... so hard to say if it was "incoming" or "outgoing" connection.

matthewdarwin · 2020-01-25T04:41:17Z

And thinking about it, I probably rarely have nodeos in a state where it needs to sync to get many older blocks because if I start a new node, then I am starting from a backup, a ZFS snapshot or a nodeos snapshot.

cc32d9 · 2020-01-29T00:15:36Z

probably the live network conditions are different, because I cannot reproduce the issue.

here's test with 2.0.0, and I'll do the same with 2.0.1 tomorrow:

rm -rf /srv/api01/data

cat >/srv/api01/etc/config.ini <<'EOT'
chain-state-db-size-mb = 65536
reversible-blocks-db-size-mb = 2048
wasm-runtime = eos-vm-jit
eos-vm-oc-enable = true
validation-mode = light
http-server-address = 127.0.0.1:8888
p2p-listen-endpoint = 127.0.0.1:9801
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
sync-fetch-span = 100
#p2p-peer-address = mainnet.eosamsterdam.net:9876
EOT

### eosio 2.0.0

/usr/bin/nodeos --data-dir /srv/api01/data --config-dir /srv/api01/etc --snapshot /var/local/snapshot-102299426.bin &

root@9A24AF1:~# cleos get info | grep head_block_num
  "head_block_num": 102299426,


kill %1


cat >/srv/api01/etc/config.ini <<'EOT'
chain-state-db-size-mb = 65536
reversible-blocks-db-size-mb = 2048
wasm-runtime = eos-vm-jit
eos-vm-oc-enable = true
validation-mode = light
http-server-address = 127.0.0.1:8888
p2p-listen-endpoint = 127.0.0.1:9801
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
sync-fetch-span = 100
p2p-peer-address = mainnet.eosamsterdam.net:9876
EOT


nohup /usr/bin/nodeos --data-dir /srv/api01/data --config-dir /srv/api01/etc &
sleep 120; cleos get info
"head_block_num": 102303226,

kill %1
mv nohup.out log1


102303226-102299426 = 3800

cat >/srv/api01/etc/config.ini <<'EOT'
chain-state-db-size-mb = 65536
reversible-blocks-db-size-mb = 2048
wasm-runtime = eos-vm-jit
eos-vm-oc-enable = true
validation-mode = light
http-server-address = 127.0.0.1:8888
p2p-listen-endpoint = 127.0.0.1:9801
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
sync-fetch-span = 500
p2p-peer-address = mainnet.eosamsterdam.net:9876
EOT

nohup /usr/bin/nodeos --data-dir /srv/api01/data --config-dir /srv/api01/etc &
sleep 120; cleos get info
"head_block_num": 102303426,

102303426-102299426 = 4000

kill %1
mv nohup.out log2

4000 blocks synced in 2 minutes, the p2p peer is in the same datacenter.

cc32d9 · 2020-01-29T07:53:51Z

I took also a snapshot from 2020-01-17, but still can't reproduce the issue between two 2.0.0 servers. It's probably related to the live network traffic and frequency of forks at that time.

cc32d9 · 2020-01-29T16:01:48Z

synching 2.0.0 against a remote 2.0.1 gives the same result. Now trying 2.0.1 from 2.0.1

cc32d9 · 2020-01-29T18:03:01Z

same result for 2.0.1 vs. 2.0.1

seems like the problem is related to the harsh network condition at the time of sumbission

cc32d9 · 2020-02-04T09:25:29Z

closing, can't reproduce again

matthewdarwin · 2020-02-27T16:57:13Z

I believe I have reproduced this issue.

2.0.3, EOS-VM, OC-enabled
add lots of peers (who may or not be synced, but at least some of them are). My test has around 40
start from a snapshot from at least a few hours ago
try to sync.
result:
nodeos CPU usage is around 5%; blocks arrive slowly

Possible fixes: either of these:

reduce number of peers (1 or 2 is good)
use release/2.0.x branch (as of 2020-02-27 00:00)
result:
in either case, nodeos CPU approaches 100% (always 100% with only 1-2 good local peers)

ndcgundlach added the needs review label Jan 28, 2020

cc32d9 closed this as completed Feb 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.0.0 resynching too slowly #8464

2.0.0 resynching too slowly #8464

cc32d9 commented Jan 17, 2020

matthewdarwin commented Jan 17, 2020

cc32d9 commented Jan 17, 2020

matthewdarwin commented Jan 17, 2020

cc32d9 commented Jan 18, 2020

matthewdarwin commented Jan 18, 2020

matthewdarwin commented Jan 25, 2020

cc32d9 commented Jan 29, 2020

cc32d9 commented Jan 29, 2020

cc32d9 commented Jan 29, 2020

cc32d9 commented Jan 29, 2020

cc32d9 commented Feb 4, 2020

matthewdarwin commented Feb 27, 2020

2.0.0 resynching too slowly #8464

2.0.0 resynching too slowly #8464

Comments

cc32d9 commented Jan 17, 2020

matthewdarwin commented Jan 17, 2020

cc32d9 commented Jan 17, 2020

matthewdarwin commented Jan 17, 2020

cc32d9 commented Jan 18, 2020

matthewdarwin commented Jan 18, 2020

matthewdarwin commented Jan 25, 2020

cc32d9 commented Jan 29, 2020

cc32d9 commented Jan 29, 2020

cc32d9 commented Jan 29, 2020

cc32d9 commented Jan 29, 2020

cc32d9 commented Feb 4, 2020

matthewdarwin commented Feb 27, 2020