Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

2.0.0 resynching too slowly #8464

Closed
cc32d9 opened this issue Jan 17, 2020 · 12 comments
Closed

2.0.0 resynching too slowly #8464

cc32d9 opened this issue Jan 17, 2020 · 12 comments

Comments

@cc32d9
Copy link
Contributor

cc32d9 commented Jan 17, 2020

I upgraded a dapp infrastructure that consisted of several 1.8 nodes on EOS mainnet. All servers are in the same datacenter. I deleted the old data, downloaded a recent snapshot from EOS Nation and EOS Sweden, then enabled wasm-runtime = eos-vm-jit, eos-vm-oc-enable = true and started nodeos from snapshot. Some nodes have state history plugin enabled, others do not.

So, first upgrades went alright and synched against public p2p quite quickly, although not occupying the CPU at 100% all the time.

The last servers synched very slowly while I tried to use direct peers in the same datacenter. It looks like a 2.0 node synching from 2.0 is affected, and when the remote peer is 1.8 it synchronizes much faster.

So, even with a remote peer in the same datacenter over a gigabit link, and only one peer configured, the node goes to 100% CPU for few seconds and then waits for something, keeping the CPU at 2% for about 10 seconds. During this idling time, the head is not advancing,. Then it resumes for few seconds and then idles again. This results in a very slow resync.

I tried various values for sync-fetch-span between 50 and 500, and it didn't change the picture.

@matthewdarwin
Copy link

It would be great if internal p2p nodes to sync with each other using 100% CPU. However, as someone who has 500 connected p2p peers on EOS Mainnet, it is also important that the protocol not abuse public p2p providers.

So if there is a view to look up speeding up transfers between trusted peers on a local network (good idea), then also please make sure it doesn't mean that public p2p providers could get abused by someone syncing from block #1 to current.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Jan 17, 2020

but it's the receiving node which is mostly busy because it needs to process those blocks and evaluate every transaction. And it doesn't get enough blocks to do the job on time

@matthewdarwin
Copy link

I just did start from snapshot on 2.0 syncing blocks from a 2.0 peer. CPU was maxed on the node receiving the blocks. wasm-runtime = eos-vm-jit, eos-vm-oc-enable = true

So I am not seeing this problem.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Jan 18, 2020

@matthewdarwin maybe it had an incoming p2p connection and synched against it quickly

@matthewdarwin
Copy link

All my internal p2p connections are bi-directional (A connects to B and B connects to A)... so hard to say if it was "incoming" or "outgoing" connection.

@matthewdarwin
Copy link

And thinking about it, I probably rarely have nodeos in a state where it needs to sync to get many older blocks because if I start a new node, then I am starting from a backup, a ZFS snapshot or a nodeos snapshot.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Jan 29, 2020

probably the live network conditions are different, because I cannot reproduce the issue.

here's test with 2.0.0, and I'll do the same with 2.0.1 tomorrow:

rm -rf /srv/api01/data

cat >/srv/api01/etc/config.ini <<'EOT'
chain-state-db-size-mb = 65536
reversible-blocks-db-size-mb = 2048
wasm-runtime = eos-vm-jit
eos-vm-oc-enable = true
validation-mode = light
http-server-address = 127.0.0.1:8888
p2p-listen-endpoint = 127.0.0.1:9801
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
sync-fetch-span = 100
#p2p-peer-address = mainnet.eosamsterdam.net:9876
EOT

### eosio 2.0.0

/usr/bin/nodeos --data-dir /srv/api01/data --config-dir /srv/api01/etc --snapshot /var/local/snapshot-102299426.bin &

root@9A24AF1:~# cleos get info | grep head_block_num
  "head_block_num": 102299426,


kill %1


cat >/srv/api01/etc/config.ini <<'EOT'
chain-state-db-size-mb = 65536
reversible-blocks-db-size-mb = 2048
wasm-runtime = eos-vm-jit
eos-vm-oc-enable = true
validation-mode = light
http-server-address = 127.0.0.1:8888
p2p-listen-endpoint = 127.0.0.1:9801
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
sync-fetch-span = 100
p2p-peer-address = mainnet.eosamsterdam.net:9876
EOT


nohup /usr/bin/nodeos --data-dir /srv/api01/data --config-dir /srv/api01/etc &
sleep 120; cleos get info
"head_block_num": 102303226,

kill %1
mv nohup.out log1


102303226-102299426 = 3800

cat >/srv/api01/etc/config.ini <<'EOT'
chain-state-db-size-mb = 65536
reversible-blocks-db-size-mb = 2048
wasm-runtime = eos-vm-jit
eos-vm-oc-enable = true
validation-mode = light
http-server-address = 127.0.0.1:8888
p2p-listen-endpoint = 127.0.0.1:9801
plugin = eosio::chain_plugin
plugin = eosio::chain_api_plugin
sync-fetch-span = 500
p2p-peer-address = mainnet.eosamsterdam.net:9876
EOT

nohup /usr/bin/nodeos --data-dir /srv/api01/data --config-dir /srv/api01/etc &
sleep 120; cleos get info
"head_block_num": 102303426,

102303426-102299426 = 4000

kill %1
mv nohup.out log2

4000 blocks synced in 2 minutes, the p2p peer is in the same datacenter.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Jan 29, 2020

I took also a snapshot from 2020-01-17, but still can't reproduce the issue between two 2.0.0 servers. It's probably related to the live network traffic and frequency of forks at that time.

@cc32d9
Copy link
Contributor Author

cc32d9 commented Jan 29, 2020

synching 2.0.0 against a remote 2.0.1 gives the same result. Now trying 2.0.1 from 2.0.1

@cc32d9
Copy link
Contributor Author

cc32d9 commented Jan 29, 2020

same result for 2.0.1 vs. 2.0.1

seems like the problem is related to the harsh network condition at the time of sumbission

@cc32d9
Copy link
Contributor Author

cc32d9 commented Feb 4, 2020

closing, can't reproduce again

@cc32d9 cc32d9 closed this as completed Feb 4, 2020
@matthewdarwin
Copy link

I believe I have reproduced this issue.

  • 2.0.3, EOS-VM, OC-enabled
  • add lots of peers (who may or not be synced, but at least some of them are). My test has around 40
  • start from a snapshot from at least a few hours ago
  • try to sync.
    result:
  • nodeos CPU usage is around 5%; blocks arrive slowly

Possible fixes: either of these:

  • reduce number of peers (1 or 2 is good)
  • use release/2.0.x branch (as of 2020-02-27 00:00)
    result:
  • in either case, nodeos CPU approaches 100% (always 100% with only 1-2 good local peers)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants