Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[base node] Chain sync issue - connected but do not sync to better chain #4637

Closed
hansieodendaal opened this issue Sep 7, 2022 · 8 comments
Assignees

Comments

@hansieodendaal
Copy link
Contributor

hansieodendaal commented Sep 7, 2022

Note: This happened a l9ot during stress testing.

Something wrong with syncing; the base node does not recognize it is on an alternate chain while it has connections to the better chain.

list-connections also reported wrong metadata about the connected peers' chain tip

This node

Public Key: 0aec98b66a4a2f9c1e58cbdfabbb8db9c56f2d1bc1e44e97445fbcbb554ae009
Node ID: c1050429d43f8bbb8a88c0bab7
Public Address: /onion3/hcpiitkxou5sy3goft6gbwclzaazxwkkkltxdr5hpo5twlypwt7zfzad:18141

had a better chain at

Height of longest chain : 2921
Total accumulated difficulty: 97361740964247006656844
Best block : ad0e0fe4a48b75c56da012f613173db1d98bd73ac3b4fa3113844acfc922a901

vs

Height of longest chain : 2924
Total accumulated difficulty: 97347400349453555030400
Best block : a9b519d3df1c4034c9f3ed2daa8cecaaecd2f40a00c318f32b6b7a9c6117ce82

but the latter did not sync to it.

@hansieodendaal
Copy link
Contributor Author

See #4657 submitted by @sdbondi

@hansieodendaal
Copy link
Contributor Author

hansieodendaal commented Sep 12, 2022

High volume stress test outstanding:
The same behavior seen from 3x out of 4x nodes in a transaction send (make-it-rain) stress test. Some stats from one of them:

  • Base node started at 09-12 17:39:19, tests soon after
  • p2p::services::liveness last receive at 09-12 18:51:50
  • comms::pipeline::outbound stopped at 18:51:51: concurrent pipelines reached the maximum (50) and did not recover
  • comms::pipeline::inbound stopped at 09-12 18:51:56: concurrent pipelines reached the maximum (50) and did not recover
  • p2p::services::liveness last send at 09-12 18:52:20
  • comms::protocol::messaging::outbound last activity at 09-13 09:47:23
  • comms::protocol::messaging::inbound still alive (receives messages)
  • ping-peer, dial-peer, discover-peer did not work after the pipelines locked up

@hansieodendaal hansieodendaal moved this from In Progress to Under observation in Tari Esme Testnet Sep 12, 2022
@SWvheerden
Copy link
Collaborator

08:04 WARN  Too many (40) connection failures, cooldown is in effect
08:05 WARN  0.0% of this node's 18 connections are using TCPv4. This node requires at least 10.0% of nodes to be TCP nodes.
08:06 WARN  Failed to read wire format byte within timeout of 45s. deadline has elapsed
08:06 WARN  Peer at address '/ip4/127.0.0.1/tcp/65033' failed to send its wire format. Expected network byte 26 or liveness byte a6 not received. Error: timed out

getting lots of these

@stringhandler stringhandler moved this from Under observation to Bugs in Tari Esme Testnet Sep 13, 2022
@stringhandler stringhandler moved this from Bugs to Must Do in Tari Esme Testnet Sep 13, 2022
@stringhandler stringhandler moved this from Must Do to In Progress in Tari Esme Testnet Sep 13, 2022
@hansieodendaal
Copy link
Contributor Author

08:04 WARN  Too many (40) connection failures, cooldown is in effect
08:05 WARN  0.0% of this node's 18 connections are using TCPv4. This node requires at least 10.0% of nodes to be TCP nodes.
08:06 WARN  Failed to read wire format byte within timeout of 45s. deadline has elapsed
08:06 WARN  Peer at address '/ip4/127.0.0.1/tcp/65033' failed to send its wire format. Expected network byte 26 or liveness byte a6 not received. Error: timed out

getting lots of these

I think we need to make a separate observation issue for this as I do not think it is related

@sdbondi
Copy link
Member

sdbondi commented Sep 14, 2022

@SWvheerden I'll clarify what these are in-case there are misunderstandings of what they mean.

08:04 WARN Too many (40) connection failures, cooldown is in effect

  • The DHT has to try to connect to neighbouring nodes periodically, these are not always available. If they are unavailable for a long time they are removed. We "cooldown" this process once we have 40 connection failures, to ensure that we don't over do it / churn. The fact that we see this is not a symptom unreliable connections, but actually fairly expected. It does feel like we need to do this less aggressively.

08:05 WARN 0.0% of this node's 18 connections are using TCPv4. This node requires at least 10.0% of nodes to be TCP nodes.

  • We post a warning if the node is not connected to x% TCP nodes, this is a message you can disregard until we actually deploy enough TCP nodes and implement a connection bias towards TCP. This message is completely expected.

08:06 WARN Failed to read wire format byte within timeout of 45s. deadline has elapsed

  • Interesting one, are you really getting a lot of these? This means that something has connected to your p2p socket but isnt sending the first byte within 45s. Could be many things, incl internet connection reliability/dropping, tor reliability, something connecting to the socket (e.g telnet), a tari p2p node actually not sending within 45s (given this is the first step in the connection, it seems unlikely). Let me know if you really are seeing a lot of these. These are expected in rare cases.

08:06 WARN Peer at address '/ip4/127.0.0.1/tcp/65033' failed to send its wire format. Expected network byte 26 or liveness byte a6 not received. Error: timed out

  • This is a follow on from the previous message.

@SWvheerden
Copy link
Collaborator

08:04 WARN Too many (40) connection failures, cooldown is in effect
This I get roughly every 10-20 minutes. If I understand this correctly, in that 10-20 minutes 40 of my neighboring nodes disappeared. Or am I completely missing whats happening here?

WARN Failed to read wire format byte within timeout of 45s. deadline has elapsed
For this log, all I can tell you is that I have not seen it locally in 4 hours (that how far back network logs I keep). But I keep seeing this locally every now and then, and in other logs I have received to check. We should definitely note this down and keep an eye on it.

@sdbondi
Copy link
Member

sdbondi commented Sep 14, 2022

08:04 WARN Too many (40) connection failures, cooldown is in effect This I get roughly every 10-20 minutes. If I understand this correctly, in that 10-20 minutes 40 of my neighboring nodes disappeared. Or am I completely missing whats happening here?

WARN Failed to read wire format byte within timeout of 45s. deadline has elapsed For this log, all I can tell you is that I have not seen it locally in 4 hours (that how far back network logs I keep). But I keep seeing this locally every now and then, and in other logs I have received to check. We should definitely note this down and keep an eye on it.

  1. no, we're tried to connect to nodes over time, of those nodes, some have failed (presumably offline), this message indicates that has happened 40 times (over any time period), we should probably count over a time period
  2. yeah, it's really unlikely to be a bug in your node, but if an external node is doing that to your node, then there is something going wrong there. 45 seconds is ages to send the first byte

@hansieodendaal hansieodendaal moved this from In Progress to Under observation in Tari Esme Testnet Sep 16, 2022
@hansieodendaal
Copy link
Contributor Author

This issue weas tested thoroughly during 3x stress tests. All base nodes and wallets used [p2p] max_concurrent_inbound_tasks = max_concurrent_outbound_tasks = 2. Everything worked fine and the block chain was synchronized throughout. Fixes to p2p inbound and outbound pipelines by sdbondi cleared up the issue.

Repository owner moved this from Under observation to Done in Tari Esme Testnet Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants