Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrapping blocks slowly crawl to a halt #3294

Closed
DarrenAlex opened this issue May 17, 2021 · 29 comments
Closed

Bootstrapping blocks slowly crawl to a halt #3294

DarrenAlex opened this issue May 17, 2021 · 29 comments
Labels
issues_fixed_by_ascending_bootstrap Issues that will become irrelevant once ascending bootstrap is in operation and the only bootstrap

Comments

@DarrenAlex
Copy link

Summary

When bootstrapping blocks, slowly the number of blocks fetched/downloaded per second goes down.
As an example, when I first started fetching blocks from other peers, I downloaded 2,000 blocks/second.
After hitting 21.3 million blocks (60 million for someone else), it slowed down to 20-30 blocks/second.

Local device resources seem to be fine

Node version

V22.0

Build details

Docker container, nothing out of the ordinary.

OS and version

Linux 20.04

Steps to reproduce the behavior

  1. Follow installation instruction in the docs
  2. Start bootstrapping
  3. Monitor block/sec progress

Expected behavior

The software continues to sync blocks at a similar/ascending rate, not a descending rate

Actual behavior

The software syncs blocks at a descending rate

Possible solution

Honestly, I don't have any idea

Supporting files

No response

@qwahzi
Copy link
Collaborator

qwahzi commented May 17, 2021

Same issue for me with multiple nodes/bootstrapping attempts (Centos 8). The slowdown happens around ~59-60M blocks for me though. In the last 4 hours, I've only gained ~100 counted blocks.

$ cat startTime.txt
           Universal time: Thu 2021-05-13 20:49:04 UTC

$ curl -d '{"action": "block_count"}' 127.0.0.1:7076
{
    "count": "60680998",
    "unchecked": "60623590",
    "cemented": "37743403"
}

$ timedatectl
           Universal time: Mon 2021-05-17 16:15:28 UTC

Not sure if related, but I see a bunch of these messages in the logs:

[2021-May-17 16:12:33.072637]: Error initiating bootstrap connection to [::ffff:143.198.231.110]:54182: Connection reset by peer
[2021-May-17 16:12:33.839897]: 27416 accounts in pull queue
[2021-May-17 16:12:44.109434]: Error initiating bootstrap connection to [::ffff:194.195.243.169]:57930: No route to host

@DarrenAlex
Copy link
Author

DarrenAlex commented May 17, 2021

I also receive those messages in the log file. I've uploaded the log files (only lines which contain the word "bootstrap") to https://node.erawan.me/log.txt

The only difference is that the counted and cemented blocks are stuck at 1 for me.
If you sum them all up, it's close to the network's total blocks. Is the cemented block number going up?

@atomlab
Copy link

atomlab commented May 20, 2021

The same problem!

@smithy9901
Copy link

+1

@qwahzi
Copy link
Collaborator

qwahzi commented May 20, 2021

My CentOS 8 node (RocksDB) suddenly caught up on count (from 70M to 117M) overnight. On V21.3 with LMDB, the same node was stuck at <60M count for over a month

curl -d '{"action": "block_count"}' 127.0.0.1:7076
{
    "count": "117825631",
    "unchecked": "972886",
    "cemented": "52043280"
}

My Windows 10 RocksDB + pruning node is still stuck:

{
    "count": "51852881",
    "unchecked": "77711715",
    "cemented": "39891722",
    "full": "13068695",
    "pruned": "38784186"
}

@michaeljs1990
Copy link

michaeljs1990 commented May 26, 2021

Running into the same issue.

{
    "count": "50864256",
    "unchecked": "65442030",
    "cemented": "33401959"
}

Gets to this point and now it's slowed to a crawl that will never catch up.

I did catch this in the logs but I'm not sure if it matters.

[2021-May-26 04:10:40.444040]: Starting legacy bootstrap attempt with ID auto_bootstrap_0
[2021-May-26 04:10:40.444231]: Bootstrap attempt stopped because there are no peers
[2021-May-26 04:10:40.444312]: Exiting legacy bootstrap attempt with ID auto_bootstrap_0
[2021-May-26 04:10:40.452590]: Beginning pending block search
[2021-May-26 04:10:40.452708]: IPC: server started

In addition I'm only at 43% synced shown by the nano node stats monitor and have database size of 100G. Is nano really already at over 200G for it's blockchain size?

My node has had consistent memory, cpu, and disk usage throughout the entire sync however I'm not really sure what it's doing with all this data it's writing. I did start from scratch with a v22 node so it shouldn't have to be doing any conversion between node database versions.

@qwahzi
Copy link
Collaborator

qwahzi commented May 26, 2021

I had the same message in my logs when my Windows 10 node disconnected:

image

2021-May-22 10:31:30.217942]: UPnP local address: 192.168.86.25, discovery: 0, IGD search: 1
[2021-May-22 10:32:34.046196]: Total recently pruned block count: 0
[2021-May-22 10:33:02.575300]: Beginning pending block search
[2021-May-22 10:37:54.012282]: Total recently pruned block count: 0
[2021-May-22 10:38:02.575701]: Beginning pending block search
[2021-May-22 10:41:25.958997]: Starting legacy bootstrap attempt with ID auto_bootstrap_85
[2021-May-22 10:41:25.958997]: Bootstrap attempt stopped because there are no peers
[2021-May-22 10:41:25.958997]: Exiting legacy bootstrap attempt with ID auto_bootstrap_85
[2021-May-22 10:43:02.576639]: Beginning pending block search
[2021-May-22 10:43:13.742616]: Total recently pruned block count: 0
[2021-May-22 10:45:16.239424]: UPnP local address: 192.168.86.25, discovery: 0, IGD search: 1
[2021-May-22 10:48:02.577391]: Beginning pending block search
[2021-May-22 10:48:33.333862]: Total recently pruned block count: 0
[2021-May-22 10:53:02.577717]: Beginning pending block search
[2021-May-22 10:53:54.688580]: Total recently pruned block count: 0
[2021-May-22 10:56:25.959882]: Starting legacy bootstrap attempt with ID auto_bootstrap_86
[2021-May-22 10:56:25.959882]: Bootstrap attempt stopped because there are no peers
[2021-May-22 10:56:25.959882]: Exiting legacy bootstrap attempt with ID auto_bootstrap_86
[2021-May-22 10:58:02.578748]: Beginning pending block search
[2021-May-22 10:59:02.242813]: UPnP local address: 192.168.86.25, discovery: 0, IGD search: 1
[2021-May-22 10:59:19.093811]: Total recently pruned block count: 0
[2021-May-22 11:03:02.579804]: Beginning pending block search
[2021-May-22 11:04:39.309938]: Total recently pruned block count: 0
[2021-May-22 11:08:02.581172]: Beginning pending block search

After restarting the node again and leaving it for another week, it seems to be making some progress:

{
    "count": "61293347",
    "unchecked": "67421914",
    "cemented": "51092686",
    "full": "12016287",
    "pruned": "49277060"
}

Centos 8 node full sync status update:

{
    "count": "118174879",
    "unchecked": "902500",
    "cemented": "74119823",
    "full": "50116606",
    "pruned": "68058273"
}

@michaeljs1990
Copy link

michaeljs1990 commented May 27, 2021

You can download the latest bootstrap database file from here and then start up the node. Took me about an hour between download and extract of the 7z file and then I was at 100% a few minutes after starting. Doesn't fix the underlying issue but if you are trying to get a node up and running it's the fastest way right now.

@DarrenAlex
Copy link
Author

downloading the latest bootstrap database file from here

The whole point of bootstrapping is to not use the ledger download.

@regfaker
Copy link

regfaker commented May 28, 2021

My CentOS 8 node (RocksDB) suddenly caught up on count (from 70M to 117M) overnight. On V21.3 with LMDB, the same node was stuck at <60M count for over a month

Same jump here, I've been on 72M/70M count/cemented at 7:13 today and now at 119M/77M count/cemented at 14:04. Days before it was crawling slowly with only hundreds of count increase in hours.

(RocksDB)

@michaeljs1990
Copy link

michaeljs1990 commented May 28, 2021

downloading the latest bootstrap database file from here

The whole point of bootstrapping is to not use the ledger download.

I agree but if your goal is to get a node running that looks to be the only way to get one up and running before the network started moving a bit faster. In the end if the bootstrap file was bad your blocks would be rejected by the network anyway.

@qwahzi
Copy link
Collaborator

qwahzi commented May 30, 2021

Centos 8 node status update, after 17 days:

$ cat startTime2.txt
           Universal time: Thu 2021-05-13 20:49:04 UTC

$ curl -d '{"action": "block_count"}' 127.0.0.1:7076
{
    "count": "118491518",
    "unchecked": "956586",
    "cemented": "98261377",
    "full": "33622456",
    "pruned": "84869062"
}

$ timedatectl
           Universal time: Sun 2021-05-30 14:11:20 UTC

Windows 10 node status update (still pretty stuck):

{
    "count": "63138031",
    "unchecked": "65450762",
    "cemented": "57874107",
    "full": "7837324",
    "pruned": "55300707"
}

@Joohansson
Copy link

Joohansson commented Jun 2, 2021

Same here, windows. 14 days. Confirming at 30 cps.

image

@qwahzi
Copy link
Collaborator

qwahzi commented Jun 2, 2021

Both of my nodes are now unstuck and getting pretty close:

Centos 8:

{
    "count": "119027602",
    "unchecked": "791562",
    "cemented": "108980055",
    "full": "28467965",
    "pruned": "90559637"
}

Windows 10:

{
    "count": "118949710",
    "unchecked": "1785684",
    "cemented": "86111699",
    "full": "39533449",
    "pruned": "79416261"
}

@Joohansson
Copy link

Joohansson commented Jun 3, 2021

Mine started to move now as well, 5000 bps. Block count 92M
Edit day 16: Still 31M unconfirmed

@qwahzi
Copy link
Collaborator

qwahzi commented Jun 6, 2021

My Centos 8 node officially finished syncing from scratch last night (sometime in the last 12 hours). For some reason it kept restarting itself roughly once a day over the last week, but there were no details in the logs

$ cat startTime.txt
           Universal time: Thu 2021-05-13 20:49:04 UTC
          
$ curl -d '{"action": "block_count"}' 127.0.0.1:7076
{
    "count": "120612378",
    "unchecked": "28987",
    "cemented": "120612370",
    "full": "25587436",
    "pruned": "95024942"
}

$ timedatectl
           Universal time: Sun 2021-06-06 13:10:32 UTC

$ du -xhs /nano
22G     /nano

$ docker ps -a
CONTAINER ID   IMAGE                     COMMAND                  CREATED       STATUS        PORTS                                                                                                                                      NAMES
XXXX nanocurrency/nano:V22.0   "/usr/bin/entry.sh n…"   2 weeks ago   Up 16 hours   127.0.0.1:7076->7076/tcp, 0.0.0.0:7075->7075/tcp, 0.0.0.0:7075->7075/udp, :::7075->7075/tcp, :::7075->7075/udp, 127.0.0.1:7078->7078/tcp   nano-container

My Windows 10 node is still going, but making progress:

{
    "count": "118992721",
    "unchecked": "2149929",
    "cemented": "95323574",
    "full": "34485057",
    "pruned": "84507664"
}

@Joohansson
Copy link

I still have 21M to confirm and mostly doing 25-30 cps so I give up. It was just an experiment but simply taking too long for me.

@zhyatt zhyatt added this to the Research for Future Release milestone Jun 22, 2021
@JohnGarnham
Copy link

Same here. Syncing my node was taking way too long and after a few days I just gave up and manually downloaded the ledger. Glad it's not just me!

@Exxenoz
Copy link
Contributor

Exxenoz commented Jul 18, 2021

My experience with V22.1: It took a few days to request all unchecked blocks, but cementing was stuck at around 1-2 blocks per second. After a week and around 2mln cemented blocks I gave up and downloaded the MyNanoNinja ledger. My node was synchronized after less than two hours, even though the node still had to request and cement a few 100k blocks.

@ghost
Copy link

ghost commented Sep 8, 2021

I think I was able to find some information on the bootstrapping problem.
My node started churning as well

  • block_count unchecked would go up then down
  • frontier_count would go up slowly.
  • logs:
27416 accounts in pull queue
...
0 accounts in pull queue
...
Flushing unchecked blocks
Finished flushing unchecked blocks
...
27416 accounts in pull queue

so I think what is happening is that rocksdb is garbage collecting the unchecked blocks after "accounts in pull queue" reaches zero, whereas leveldb only garbage collects when you explicitly ask for it.

@zhyatt
Copy link
Collaborator

zhyatt commented Sep 10, 2021

so I think what is happening is that rocksdb is garbage collecting the unchecked blocks after "accounts in pull queue" reaches zero, whereas leveldb only garbage collects when you explicitly ask for it.

@theohax @clemahieu Although this note from @coranos (thanks!) may be RocksDB specific, I wanted to highlight for you as you continue to look into bootstrapping activities.

@ghost
Copy link

ghost commented Sep 12, 2021

I did some more investigation, and I believe I have found a clue.

Using frontiers, I was able to determine an account that took several hours to appear. It was taking a long time in both rocksdb and leveldb, and it's on bananos v22. So ymmv.

these accounts were in a chunk of accounts that took 8 hours of 100% cpu churning to appear:
[
"ban_1ah4xni57wd9m3ub9b6u59714ngntyugm5689gamthgibzr1nrqn1j15rbxr",
"ban_1tyjsn7ra7z7qb6a9akfxrsonqtzfgnwggkzhodccnuss4ms5bieacnkeftk",
"ban_3aegykcqtuggoc98ypm7q5aicx7t6s1exh7nxhyycocac1amg8j8noi4gxjr",
"ban_3fb7gkrncm3ojo96fnpwhkrj1km4dh8gzo4sdxzjkmqcbtynqd94z7rua1nz",
"ban_3fqcofp1yempyttg6pnouwd685w8t3aoo78trmqkkk9q6gr3xinastt5yo5s"
]

Most are 2-3 hops from faucets (and the faucet would be close to genesis) so may be false positives.

The account ban_1tyjsn7ra7z7qb6a9akfxrsonqtzfgnwggkzhodccnuss4ms5bieacnkeftk
was fed from the account ban_1kshbcgqqzwwx8wmm994xke7bzrngt4thhxhhf49ybxth9tesgtj6acog63e
which isnt a know faucet, and has a high block height of 10993.
so it may be that ban_1tyjsn took a while due to being far from genesis.
I don't have exact telemetry to prove that, but that is what I have so far.

https://creeper.banano.cc/explorer/block/BA9FADD1F3FC347D32DA3FFCFCBC07C21E4D9FAA440D3F71236AB5C7A2B89251

@jarviscodes
Copy link

FYI:
My node was completely synced in less than 12 Hours.

Start time: Sep 13 13:39
End time: Sep 14 11:58

I have no "legacy bootstrap attempts" in my logs.
It's probably worth mentioning that I bootstrapped v20 instead of v22.

@ghost
Copy link

ghost commented Sep 17, 2021

Was the fork block resolution code changed in v22?

I did some tests in bananos, since it had a smaller list of frontiers. a $40/mo 8gb 4 core server synchs to 24m of 27m blocks in 24h.

If I then compare frontiers of a fully synched block, and work through the link and previous blocks, I find hash 2405BCECCFECFDA6E72A72445BBAB2D0411C1B6315685E65D869734402647223 in the fully synched banano node, but absent from the out of synch node.

if I try to process the block, I get a fork block on the out of synch node.

if I force the fork to resolve in favor of the 2405..7223 block (so the out of synch node matches the fully synched node), it then synchs another ~1000 blocks.

I'm now rerunning the scan looking for another fork block.

@ghost
Copy link

ghost commented Sep 18, 2021

no more fork blocks so far, but I have found a handful of other blocks using the following algorithm:

  1. get frontiers from fully synched node (1078223 accounts)
  2. get frontiers from desynched node (1064159 accounts)
  3. get all frontiers that are in fully synched that are not in desynched (~25,000 hashes)
  4. using fronters as a set of hashes, pull block from fully synched nodes, look for link and previous in desynched node.
  5. for any block where the link and previous are in the desynched node, call 'process' to process the fully synched block on the desynched node.
  6. repeat 4-6 using the set of not-found link and previous as the set of hashes.

So I start at the gap blocks, and try to work my way towards genesis.

@ghost
Copy link

ghost commented Sep 20, 2021

Update from banano, it appears we have resolved the problem on banano.

The problem was this: We did not have over 50% rep weight on v22, most of it was on v20.

The resolution was this: When over 50% rep weight moved to v22, a new node synched successfully with no perceptible slowdown, and the fork that was detected earlier did not occur. In addition, block cementing happened at a much more rapid pace.

Since nano is on v22 and has been for months, someone may want to try to bootstrap nano from genesis and see if the issue still occurs or has resolved itself.

@dakshsingh
Copy link

I had this issue too. New to nano and wanted to set up a representative using docker. Started bootstrapping on 22.1 about 6 days back; plenty of RAM & cores, fast NVME drive. Reached about 50M blocks cemented in about 2 days and then slowed to a crawl, intense disk activity continuing. Reached just 61M blocks over the next 4 days. and decided to switch over to a downloaded ledger. Interestingly, the downloaded data.ldb was about 50GB uncompressed while the existing semi-done data.ldb that I replaced was over 140GB.

Currently cementing blocks at a high rate. Will post back once done.

@dakshsingh
Copy link

Yes. Synced up in about an hour.

@qwahzi qwahzi added the issues_fixed_by_ascending_bootstrap Issues that will become irrelevant once ascending bootstrap is in operation and the only bootstrap label Dec 28, 2022
@qwahzi
Copy link
Collaborator

qwahzi commented May 25, 2023

This appears to have been resolved by ascending bootstrap in V25. Let me know if you're still having issues and we can reopen the issue.

@qwahzi qwahzi closed this as completed May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issues_fixed_by_ascending_bootstrap Issues that will become irrelevant once ascending bootstrap is in operation and the only bootstrap
Projects
None yet
Development

No branches or pull requests