Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client Problems: Max Listener Memory Leak / Log Output Format / Snappy Errors / Invalid Trie Sync Stops #1539

Closed
holgerd77 opened this issue Oct 22, 2021 · 7 comments

Comments

@holgerd77
Copy link
Member

Various problems with our beloved client 🙂 making everything a bit production-unready atm, with some which might have been introduced along the latest merge work.

Not sure if it's wise to throw all this into one issue, but I also didn't want to blow things up too much, guess (most) things can be fixed along the sideline (I will actually directly start after posting here).

1. Max Listener Memory Leak

We still have a memory leak caused by too much event listeners added and not removed, this is from the server log:

grafik

Should be easily fixable though.

2. Log Output Format Error

The normal log output (so running from the console with npm run client:start) is now always adding a 39m at the end of the colored labels from log lines, see e.g.:

grafik

This has close-to-for-sure been introduced along the log file work in the latest Merge PR and should be very close located to these changes. Had some look into that yesterday but couldn't completely locate and fix yet.

3. Snappy Errors

There are new snappy errors looking like this:

Oct 21 20:23:26 ip-10-0-101-22 nvm-exec[34634]: INFO [10-21|20:23:26] New sync target height number=13463136 hash=ca61a4a2...
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]: ERROR [10-21|20:23:31] Error: Invalid Snappy bitstream
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Object.uncompress (/home/ubuntu/ethereumjs-monorepo/node_modules/snappyjs/index.js:88:13)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Peer._handleBody (/home/ubuntu/ethereumjs-monorepo/packages/devp2p/src/rlpx/peer.ts:540:26)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Peer._onSocketData (/home/ubuntu/ethereumjs-monorepo/packages/devp2p/src/rlpx/peer.ts:601:18)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Socket.emit (events.js:400:28)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Socket.emit (domain.js:475:12)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at addChunk (internal/streams/readable.js:293:12)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at readableAddChunk (internal/streams/readable.js:267:9)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Socket.Readable.push (internal/streams/readable.js:206:10)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at TCP.onStreamRead (internal/stream_base_commons.js:188:23)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]: WARN [10-21|20:23:31] Server error: Error - Invalid Snappy bitstream
Oct 21 20:23:38 ip-10-0-101-22 nvm-exec[34634]: INFO [10-21|20:23:38] Imported blocks count=50 number=128919 hash=b74600d5... hardfork=chainstart peers=9
Oct 21 20:23:38 ip-10-0-101-22 nvm-exec[34634]: INFO [10-21|20:23:38] Executed blocks count=50 first=128918 hash=7bac555a... td=298457505472177195 hardfork=chainstart last=128968 hash=8c881ae2... with txs=11

These don't seem to break the control flow, but might have bad consequences under the hood (like a normally good peer getting disconnected or something) and therefore should be examined. Might be that my latest Snappy fix at #1484 didn't get it in the full turn (this was somewhat hacky in the first place).

Errors occur regularly so should be relatively easily be reproducable.

4. Invalid Trie Sync Stops

Last thing, likely the most severe/annoying one: sync (tested on mainnet on the server) breaks regularly (after a couple of 10.000 up to 100.000 blocks) with an invalid receipt trie or invalid tx trie or similar message, see e.g.:

grafik

This completely breaks the sync flow and sync is never picked up after that (bad 🙃). (Somewhat) good thing is is that sync will continue once the client gets restarted. So this rather hints at something like that we process a block in the wrong order or maybe handle a block or block common or the like in some occasions in a way which breaks the block setup and or structure.

Anyhow. To be analyzed.

Not totally easy because this can only be triggered relatively irregularly (it does happen though on every client start after maximally 30-60 minutes).

Guess the best attempt might be to really carefully craft in some meaningful debugging output additions and then just let it run on the sideline. Might give this a try during the day.

@holgerd77
Copy link
Member Author

And the extended error messages from #1540 directly in action in context of issue 4.:

grafik

(even directly taking over the coloring 🙂)

@holgerd77
Copy link
Member Author

Client is not yet stable - especially regarding 4. - everyone who wants to debug please try to sync a couple of 1000 blocks (> 5.000 or so).

@holgerd77
Copy link
Member Author

Also just dropping here, another thing and maybe this was a onetime thing, but since it was very spooky do not want it to get lost. On one of the latest server runs I had:

grafik

@ryanio
Copy link
Contributor

ryanio commented Oct 26, 2021

nice the new errors already coming in handy, yes looks like we are inserting blocks out of order, this 79923 one seemed to have jumped 6 blocks early and then 80643 being executed after is really strange, going to take a deeper look at the blockfetcher

INFO [10-25|18:02:53] Imported blocks count=50 number=79866 hash=cba85ba8... hardfork=chainstart peers=9
finished:  10
INFO [10-25|18:02:53] Executed blocks count=50 first=79865 hash=da065ba6... td=105538793080819005 hardfork=chainstart last=79915 hash=f310c295... with txs=37
INFO [10-25|18:02:53] Executed blocks count=2 first=79915 hash=f310c295... td=105541141406504755 hardfork=chainstart last=79917 hash=de1755a1... with txs=38
WARN [10-25|18:02:53] Error along storing received block or header result: Error: invalid uncle hash (block number=79923 hash=0x818faf022269da8c3f20ba3630a8b90e3b6ba2c1769e74509e40e325114d2112 hf=chainstart baseFeePerGas=none txs=0 uncles=1)
INFO [10-25|18:02:57] New sync target height number=13489962 hash=29820ffe...
WARN [10-25|18:02:57] Error along storing received block or header result: Error: could not find parent header (block header number=80643 hash=0x4fd6cce23237ea6377c86ac084d9ded97bda7cf773b36795fae315953d13d103 hf=chainstart baseFeePerGas=none)

I have some fixes for the snappy uncompress coming in a PR.


Here's another clue, this one was properly executed in order, but when I look up this block on etherscan it says there is 1 tx and below says txs=0. So perhaps the body is returning with no result )or timing out) and still being inserted instead of invalidated or the information re-requested. Will look into it at blockfetcher.ts:L42

WARN [10-25|18:18:19] Error along storing received block or header result: Error: invalid transaction trie (block number=92789 hash=0x05d00c31f341c74e3103f7f49930dc5b8ef55dd3a5508d9be14399e09260b2fd hf=chainstart baseFeePerGas=none txs=0 uncles=0)

@holgerd77
Copy link
Member Author

Issue 4 still present, see e.g. the following screenshot from the latest client run on the server (actually needed to fix the service script for this, this is now working):

grafik

Should get easier debuggable with Fetcher debug logger from #1544 activated.

@holgerd77
Copy link
Member Author

Here is some debug output on another fail, failing block at the end is 672044:

grafik

Associated job is re-enqueued in the fetcher ("undefined or emtpy result set returned"). The renewedly executed request can also be seen in the screenshot from above.

Then - somewhat further down the line - this happens:

grafik

Respective block is shown as a 0-transactions block - which is wrong since due to Etherscan the block contains 1 transaction. This leads to the transaction trie being wrong (respectively the root not match) (guess a similar observation has already been done by Ryan @ryanio).

Hmm, still a bit hard to debug, not yet clear to me why what's the connection between the re-request and the block falsely having (or being instantiated with) no transactions. 🤔

@holgerd77
Copy link
Member Author

All somewhat addressed, client is syncing stable again, will close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants