Client Problems: Max Listener Memory Leak / Log Output Format / Snappy Errors / Invalid Trie Sync Stops #1539

holgerd77 · 2021-10-22T08:16:15Z

Various problems with our beloved client 🙂 making everything a bit production-unready atm, with some which might have been introduced along the latest merge work.

Not sure if it's wise to throw all this into one issue, but I also didn't want to blow things up too much, guess (most) things can be fixed along the sideline (I will actually directly start after posting here).

1. Max Listener Memory Leak

We still have a memory leak caused by too much event listeners added and not removed, this is from the server log:

Should be easily fixable though.

2. Log Output Format Error

The normal log output (so running from the console with npm run client:start) is now always adding a 39m at the end of the colored labels from log lines, see e.g.:

This has close-to-for-sure been introduced along the log file work in the latest Merge PR and should be very close located to these changes. Had some look into that yesterday but couldn't completely locate and fix yet.

3. Snappy Errors

There are new snappy errors looking like this:

Oct 21 20:23:26 ip-10-0-101-22 nvm-exec[34634]: INFO [10-21|20:23:26] New sync target height number=13463136 hash=ca61a4a2...
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]: ERROR [10-21|20:23:31] Error: Invalid Snappy bitstream
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Object.uncompress (/home/ubuntu/ethereumjs-monorepo/node_modules/snappyjs/index.js:88:13)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Peer._handleBody (/home/ubuntu/ethereumjs-monorepo/packages/devp2p/src/rlpx/peer.ts:540:26)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Peer._onSocketData (/home/ubuntu/ethereumjs-monorepo/packages/devp2p/src/rlpx/peer.ts:601:18)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Socket.emit (events.js:400:28)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Socket.emit (domain.js:475:12)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at addChunk (internal/streams/readable.js:293:12)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at readableAddChunk (internal/streams/readable.js:267:9)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at Socket.Readable.push (internal/streams/readable.js:206:10)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]:     at TCP.onStreamRead (internal/stream_base_commons.js:188:23)
Oct 21 20:23:31 ip-10-0-101-22 nvm-exec[34634]: WARN [10-21|20:23:31] Server error: Error - Invalid Snappy bitstream
Oct 21 20:23:38 ip-10-0-101-22 nvm-exec[34634]: INFO [10-21|20:23:38] Imported blocks count=50 number=128919 hash=b74600d5... hardfork=chainstart peers=9
Oct 21 20:23:38 ip-10-0-101-22 nvm-exec[34634]: INFO [10-21|20:23:38] Executed blocks count=50 first=128918 hash=7bac555a... td=298457505472177195 hardfork=chainstart last=128968 hash=8c881ae2... with txs=11

These don't seem to break the control flow, but might have bad consequences under the hood (like a normally good peer getting disconnected or something) and therefore should be examined. Might be that my latest Snappy fix at #1484 didn't get it in the full turn (this was somewhat hacky in the first place).

Errors occur regularly so should be relatively easily be reproducable.

4. Invalid Trie Sync Stops

Last thing, likely the most severe/annoying one: sync (tested on mainnet on the server) breaks regularly (after a couple of 10.000 up to 100.000 blocks) with an invalid receipt trie or invalid tx trie or similar message, see e.g.:

This completely breaks the sync flow and sync is never picked up after that (bad 🙃). (Somewhat) good thing is is that sync will continue once the client gets restarted. So this rather hints at something like that we process a block in the wrong order or maybe handle a block or block common or the like in some occasions in a way which breaks the block setup and or structure.

Anyhow. To be analyzed.

Not totally easy because this can only be triggered relatively irregularly (it does happen though on every client start after maximally 30-60 minutes).

Guess the best attempt might be to really carefully craft in some meaningful debugging output additions and then just let it run on the sideline. Might give this a try during the day.

The text was updated successfully, but these errors were encountered:

holgerd77 · 2021-10-22T13:21:32Z

And the extended error messages from #1540 directly in action in context of issue 4.:

(even directly taking over the coloring 🙂)

holgerd77 · 2021-10-24T19:54:54Z

Client is not yet stable - especially regarding 4. - everyone who wants to debug please try to sync a couple of 1000 blocks (> 5.000 or so).

holgerd77 · 2021-10-24T19:57:38Z

Also just dropping here, another thing and maybe this was a onetime thing, but since it was very spooky do not want it to get lost. On one of the latest server runs I had:

ryanio · 2021-10-26T01:09:07Z

nice the new errors already coming in handy, yes looks like we are inserting blocks out of order, this 79923 one seemed to have jumped 6 blocks early and then 80643 being executed after is really strange, going to take a deeper look at the blockfetcher

INFO [10-25|18:02:53] Imported blocks count=50 number=79866 hash=cba85ba8... hardfork=chainstart peers=9
finished:  10
INFO [10-25|18:02:53] Executed blocks count=50 first=79865 hash=da065ba6... td=105538793080819005 hardfork=chainstart last=79915 hash=f310c295... with txs=37
INFO [10-25|18:02:53] Executed blocks count=2 first=79915 hash=f310c295... td=105541141406504755 hardfork=chainstart last=79917 hash=de1755a1... with txs=38
WARN [10-25|18:02:53] Error along storing received block or header result: Error: invalid uncle hash (block number=79923 hash=0x818faf022269da8c3f20ba3630a8b90e3b6ba2c1769e74509e40e325114d2112 hf=chainstart baseFeePerGas=none txs=0 uncles=1)
INFO [10-25|18:02:57] New sync target height number=13489962 hash=29820ffe...
WARN [10-25|18:02:57] Error along storing received block or header result: Error: could not find parent header (block header number=80643 hash=0x4fd6cce23237ea6377c86ac084d9ded97bda7cf773b36795fae315953d13d103 hf=chainstart baseFeePerGas=none)

I have some fixes for the snappy uncompress coming in a PR.

Here's another clue, this one was properly executed in order, but when I look up this block on etherscan it says there is 1 tx and below says txs=0. So perhaps the body is returning with no result )or timing out) and still being inserted instead of invalidated or the information re-requested. Will look into it at blockfetcher.ts:L42

WARN [10-25|18:18:19] Error along storing received block or header result: Error: invalid transaction trie (block number=92789 hash=0x05d00c31f341c74e3103f7f49930dc5b8ef55dd3a5508d9be14399e09260b2fd hf=chainstart baseFeePerGas=none txs=0 uncles=0)

holgerd77 · 2021-10-26T13:48:51Z

Issue 4 still present, see e.g. the following screenshot from the latest client run on the server (actually needed to fix the service script for this, this is now working):

Should get easier debuggable with Fetcher debug logger from #1544 activated.

holgerd77 · 2021-10-26T14:39:04Z

Here is some debug output on another fail, failing block at the end is 672044:

Associated job is re-enqueued in the fetcher ("undefined or emtpy result set returned"). The renewedly executed request can also be seen in the screenshot from above.

Then - somewhat further down the line - this happens:

Respective block is shown as a 0-transactions block - which is wrong since due to Etherscan the block contains 1 transaction. This leads to the transaction trie being wrong (respectively the root not match) (guess a similar observation has already been done by Ryan @ryanio).

Hmm, still a bit hard to debug, not yet clear to me why what's the connection between the re-request and the block falsely having (or being instantiated with) no transactions. 🤔

holgerd77 · 2021-11-01T10:56:53Z

All somewhat addressed, client is syncing stable again, will close.

holgerd77 added package: client type: bug labels Oct 22, 2021

holgerd77 mentioned this issue Oct 22, 2021

Tx, Block, VM: Consistent Error Context for Error Messages #1540

Merged

ryanio mentioned this issue Oct 22, 2021

client/vm: small fixes #1541

Merged

ryanio mentioned this issue Oct 26, 2021

client: sync stability fixes #1543

Merged

holgerd77 closed this as completed Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client Problems: Max Listener Memory Leak / Log Output Format / Snappy Errors / Invalid Trie Sync Stops #1539

Client Problems: Max Listener Memory Leak / Log Output Format / Snappy Errors / Invalid Trie Sync Stops #1539

holgerd77 commented Oct 22, 2021

holgerd77 commented Oct 22, 2021

holgerd77 commented Oct 24, 2021

holgerd77 commented Oct 24, 2021

ryanio commented Oct 26, 2021 •

edited

Loading

holgerd77 commented Oct 26, 2021

holgerd77 commented Oct 26, 2021

holgerd77 commented Nov 1, 2021

Client Problems: Max Listener Memory Leak / Log Output Format / Snappy Errors / Invalid Trie Sync Stops #1539

Client Problems: Max Listener Memory Leak / Log Output Format / Snappy Errors / Invalid Trie Sync Stops #1539

Comments

holgerd77 commented Oct 22, 2021

1. Max Listener Memory Leak

2. Log Output Format Error

3. Snappy Errors

4. Invalid Trie Sync Stops

holgerd77 commented Oct 22, 2021

holgerd77 commented Oct 24, 2021

holgerd77 commented Oct 24, 2021

ryanio commented Oct 26, 2021 • edited Loading

holgerd77 commented Oct 26, 2021

holgerd77 commented Oct 26, 2021

holgerd77 commented Nov 1, 2021

ryanio commented Oct 26, 2021 •

edited

Loading