[base node] Chain sync issue - connected but do not sync to better chain #4637

hansieodendaal · 2022-09-07T14:22:34Z

Note: This happened a l9ot during stress testing.

Something wrong with syncing; the base node does not recognize it is on an alternate chain while it has connections to the better chain.

list-connections also reported wrong metadata about the connected peers' chain tip

This node

Public Key: 0aec98b66a4a2f9c1e58cbdfabbb8db9c56f2d1bc1e44e97445fbcbb554ae009
Node ID: c1050429d43f8bbb8a88c0bab7
Public Address: /onion3/hcpiitkxou5sy3goft6gbwclzaazxwkkkltxdr5hpo5twlypwt7zfzad:18141

had a better chain at

Height of longest chain : 2921
Total accumulated difficulty: 97361740964247006656844
Best block : ad0e0fe4a48b75c56da012f613173db1d98bd73ac3b4fa3113844acfc922a901

vs

Height of longest chain : 2924
Total accumulated difficulty: 97347400349453555030400
Best block : a9b519d3df1c4034c9f3ed2daa8cecaaecd2f40a00c318f32b6b7a9c6117ce82

but the latter did not sync to it.

The text was updated successfully, but these errors were encountered:

hansieodendaal · 2022-09-12T08:29:18Z

See #4657 submitted by @sdbondi

hansieodendaal · 2022-09-12T14:04:21Z

High volume stress test ~~outstanding~~:
The same behavior seen from 3x out of 4x nodes in a transaction send (make-it-rain) stress test. Some stats from one of them:

Base node started at 09-12 17:39:19, tests soon after
p2p::services::liveness last receive at 09-12 18:51:50
comms::pipeline::outbound stopped at 18:51:51: concurrent pipelines reached the maximum (50) and did not recover
comms::pipeline::inbound stopped at 09-12 18:51:56: concurrent pipelines reached the maximum (50) and did not recover
p2p::services::liveness last send at 09-12 18:52:20
comms::protocol::messaging::outbound last activity at 09-13 09:47:23
comms::protocol::messaging::inbound still alive (receives messages)
ping-peer, dial-peer, discover-peer did not work after the pipelines locked up

SWvheerden · 2022-09-13T06:08:09Z

08:04 WARN  Too many (40) connection failures, cooldown is in effect
08:05 WARN  0.0% of this node's 18 connections are using TCPv4. This node requires at least 10.0% of nodes to be TCP nodes.
08:06 WARN  Failed to read wire format byte within timeout of 45s. deadline has elapsed
08:06 WARN  Peer at address '/ip4/127.0.0.1/tcp/65033' failed to send its wire format. Expected network byte 26 or liveness byte a6 not received. Error: timed out

getting lots of these

hansieodendaal · 2022-09-14T08:48:04Z

08:04 WARN  Too many (40) connection failures, cooldown is in effect
08:05 WARN  0.0% of this node's 18 connections are using TCPv4. This node requires at least 10.0% of nodes to be TCP nodes.
08:06 WARN  Failed to read wire format byte within timeout of 45s. deadline has elapsed
08:06 WARN  Peer at address '/ip4/127.0.0.1/tcp/65033' failed to send its wire format. Expected network byte 26 or liveness byte a6 not received. Error: timed out

getting lots of these

I think we need to make a separate observation issue for this as I do not think it is related

sdbondi · 2022-09-14T09:11:43Z

@SWvheerden I'll clarify what these are in-case there are misunderstandings of what they mean.

08:04 WARN Too many (40) connection failures, cooldown is in effect

The DHT has to try to connect to neighbouring nodes periodically, these are not always available. If they are unavailable for a long time they are removed. We "cooldown" this process once we have 40 connection failures, to ensure that we don't over do it / churn. The fact that we see this is not a symptom unreliable connections, but actually fairly expected. It does feel like we need to do this less aggressively.

08:05 WARN 0.0% of this node's 18 connections are using TCPv4. This node requires at least 10.0% of nodes to be TCP nodes.

We post a warning if the node is not connected to x% TCP nodes, this is a message you can disregard until we actually deploy enough TCP nodes and implement a connection bias towards TCP. This message is completely expected.

08:06 WARN Failed to read wire format byte within timeout of 45s. deadline has elapsed

Interesting one, are you really getting a lot of these? This means that something has connected to your p2p socket but isnt sending the first byte within 45s. Could be many things, incl internet connection reliability/dropping, tor reliability, something connecting to the socket (e.g telnet), a tari p2p node actually not sending within 45s (given this is the first step in the connection, it seems unlikely). Let me know if you really are seeing a lot of these. These are expected in rare cases.

08:06 WARN Peer at address '/ip4/127.0.0.1/tcp/65033' failed to send its wire format. Expected network byte 26 or liveness byte a6 not received. Error: timed out

This is a follow on from the previous message.

SWvheerden · 2022-09-14T09:36:23Z

08:04 WARN Too many (40) connection failures, cooldown is in effect
This I get roughly every 10-20 minutes. If I understand this correctly, in that 10-20 minutes 40 of my neighboring nodes disappeared. Or am I completely missing whats happening here?

WARN Failed to read wire format byte within timeout of 45s. deadline has elapsed
For this log, all I can tell you is that I have not seen it locally in 4 hours (that how far back network logs I keep). But I keep seeing this locally every now and then, and in other logs I have received to check. We should definitely note this down and keep an eye on it.

sdbondi · 2022-09-14T10:30:25Z

08:04 WARN Too many (40) connection failures, cooldown is in effect This I get roughly every 10-20 minutes. If I understand this correctly, in that 10-20 minutes 40 of my neighboring nodes disappeared. Or am I completely missing whats happening here?

WARN Failed to read wire format byte within timeout of 45s. deadline has elapsed For this log, all I can tell you is that I have not seen it locally in 4 hours (that how far back network logs I keep). But I keep seeing this locally every now and then, and in other logs I have received to check. We should definitely note this down and keep an eye on it.

no, we're tried to connect to nodes over time, of those nodes, some have failed (presumably offline), this message indicates that has happened 40 times (over any time period), we should probably count over a time period
yeah, it's really unlikely to be a bug in your node, but if an external node is doing that to your node, then there is something going wrong there. 45 seconds is ages to send the first byte

hansieodendaal · 2022-09-20T14:20:06Z

This issue weas tested thoroughly during 3x stress tests. All base nodes and wallets used [p2p] max_concurrent_inbound_tasks = max_concurrent_outbound_tasks = 2. Everything worked fine and the block chain was synchronized throughout. Fixes to p2p inbound and outbound pipelines by sdbondi cleared up the issue.

stringhandler moved this from Bugs to In Progress in Tari Esme Testnet Sep 9, 2022

stringhandler assigned hansieodendaal Sep 9, 2022

hansieodendaal assigned sdbondi Sep 9, 2022

hansieodendaal moved this from In Progress to Under observation in Tari Esme Testnet Sep 12, 2022

stringhandler moved this from Under observation to Bugs in Tari Esme Testnet Sep 13, 2022

stringhandler moved this from Bugs to Must Do in Tari Esme Testnet Sep 13, 2022

stringhandler moved this from Must Do to In Progress in Tari Esme Testnet Sep 13, 2022

hansieodendaal moved this from In Progress to Under observation in Tari Esme Testnet Sep 16, 2022

hansieodendaal closed this as completed Sep 20, 2022

Repository owner moved this from Under observation to Done in Tari Esme Testnet Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[base node] Chain sync issue - connected but do not sync to better chain #4637

[base node] Chain sync issue - connected but do not sync to better chain #4637

hansieodendaal commented Sep 7, 2022 •

edited

Loading

hansieodendaal commented Sep 12, 2022

hansieodendaal commented Sep 12, 2022 •

edited

Loading

SWvheerden commented Sep 13, 2022

hansieodendaal commented Sep 14, 2022

sdbondi commented Sep 14, 2022 •

edited

Loading

SWvheerden commented Sep 14, 2022

sdbondi commented Sep 14, 2022

hansieodendaal commented Sep 20, 2022

[base node] Chain sync issue - connected but do not sync to better chain #4637

[base node] Chain sync issue - connected but do not sync to better chain #4637

Comments

hansieodendaal commented Sep 7, 2022 • edited Loading

hansieodendaal commented Sep 12, 2022

hansieodendaal commented Sep 12, 2022 • edited Loading

SWvheerden commented Sep 13, 2022

hansieodendaal commented Sep 14, 2022

sdbondi commented Sep 14, 2022 • edited Loading

SWvheerden commented Sep 14, 2022

sdbondi commented Sep 14, 2022

hansieodendaal commented Sep 20, 2022

hansieodendaal commented Sep 7, 2022 •

edited

Loading

hansieodendaal commented Sep 12, 2022 •

edited

Loading

sdbondi commented Sep 14, 2022 •

edited

Loading