Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base node connectivity starvation - cannot ping active connections or receive metadata #6516

Closed
hansieodendaal opened this issue Aug 31, 2024 · 1 comment
Assignees

Comments

@hansieodendaal
Copy link
Contributor

hansieodendaal commented Aug 31, 2024

Node 1 to node 3

Image

>> get-peer a448be3c78c0aecab52b2e677b2a78928c9c9ddb491f20c533ecbd059e401726
Emoji ID: πŸ‘’πŸŽ’βž•πŸΌπŸšπŸ’°πŸ’‰πŸ“–πŸ’”πŸ£πŸͺπŸ₯🐝πŸ₯πŸšπŸΎπŸ·πŸ‘˜πŸ‘™πŸ”«πŸŽ£πŸŽπŸπŸ’ΌπŸ₯πŸš€πŸ’©πŸ‹πŸ’ƒπŸŽƒπŸ†πŸ—πŸŽ±
Public Key: a448be3c78c0aecab52b2e677b2a78928c9c9ddb491f20c533ecbd059e401726
NodeId: 57f8de99f1371a7e1f5d0110f2
Addresses:
- /onion3/g4uzpwqgf7biyrfp33vbkyfrfsv6vgnylom6xq2lw5plnus7d5uk2fqd:18141 Score: Some(200)  - Source: FromDiscovery Latency: Some(6.222805ms) - Last Seen: 2024-08-31 05:23:23.196727900 - Last Failure:None
- /ip4/192.168.5.100/tcp/9993 Score: Some(200)  - Source: Config Latency: Some(9.376276ms) - Last Seen: 2024-08-30 15:08:21.470117400 - Last Failure:None
User agent: tari/basenode/1.3.0-pre.0
Features: PeerFeatures(MESSAGE_PROPAGATION | DHT_STORE_FORWARD)
Flags: PeerFlags(SEED)
Supported protocols:
- t/mempool/1
- t/mempool-sync/1
- t/dht/1
- t/bnwallet/1
- t/blksync/1
- t/msg/0.1
Last seen: 2024-08-31 05:23:23.196727900
>>
>>
>> dial-peer a448be3c78c0aecab52b2e677b2a78928c9c9ddb491f20c533ecbd059e401726
>> ☎️  Dialing peer...
⚑️ Peer connected in 0ms!
Connection: Id: 159, Node ID: 57f8de99f1371a7e, Direction: Inbound, Peer Address: /ip4/192.168.5.100/tcp/52844, Age: 177751s, #Substreams: 0, #Refs: 2

>>
>> ping-peer a448be3c78c0aecab52b2e677b2a78928c9c9ddb491f20c533ecbd059e401726
πŸ“ Pinging peer...
>>
2024-09-02 08:36:53.118083600 [comms::middleware::message_logging] DEBUG Pre Broadcast [7dcc333714f9b31e]SendMsg(DirectNodeId(57f8de99f1371a7e1f5d0110f2) - <111 bytes>)
2024-09-02 08:36:53.118115400 [comms::dht::outbound::broadcast_middleware] TRACE Processing outbound request SendMsg(DirectNodeId(57f8de99f1371a7e1f5d0110f2) - <111 bytes>)
2024-09-02 08:36:53.118128900 [comms::dht::outbound::broadcast_middleware] TRACE Send params: FinalSendMessageParams { broadcast_strategy: DirectNodeId(NodeId(57f8de99f1371a7e1f5d0110f2)), destination: Unknown, encryption: ClearText, is_discovery_enabled: false, force_origin: false, dht_message_type: None, dht_message_flags: DhtMessageFlags(0x0), dht_header: None, debug_info: Some("Send direct to 57f8de99f1371a7e1f5d0110f2. Source: Start ping round"), tag: Some(MessageTag(11450491070897158372)) }
2024-09-02 08:36:53.118175500 [comms::dht::actor] TRACE DhtActor received request: SelectPeers (Strategy=DirectNodeId(57f8de99f1371a7e1f5d0110f2))
2024-09-02 08:36:53.118530200 [tari_comms::protocol::messaging::outbound] DEBUG comms::messaging::outbound; node_id="57f8de99f1371a7e1f5d0110f2"
2024-09-02 08:36:53.118562300 [comms::protocol::messaging::outbound] DEBUG Attempting to dial peer '57f8de99f1371a7e1f5d0110f2' if required
2024-09-02 08:36:53.118578300 [tari_comms::protocol::messaging::outbound] DEBUG establish_connection; node_id="57f8de99f1371a7e1f5d0110f2"
2024-09-02 08:36:53.118585600 [comms::protocol::messaging::outbound] DEBUG Attempting to establish messaging protocol connection to peer `57f8de99f1371a7e1f5d0110f2`
2024-09-02 08:36:53.118596100 [comms::connectivity::manager] TRACE Request: DialPeer { node_id: NodeId(57f8de99f1371a7e1f5d0110f2), reply_tx: Some(Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) }) }
2024-09-02 08:36:53.118618100 [comms::protocol::messaging::outbound] DEBUG Connection succeeded for peer `57f8de99f1371a7e1f5d0110f2` in 25Β΅s
2024-09-02 08:36:53.118812200 [comms::protocol::messaging::outbound] DEBUG Substream established for peer `57f8de99f1371a7e1f5d0110f2`
2024-09-02 08:36:53.118856500 [tari_co`mms::protocol::messaging::outbound] DEBUG start_forwarding_messages; node_id="57f8de99f1371a7e1f5d0110f2"
2024-09-02 08:36:53.118883500 [comms::protocol::messaging::outbound] DEBUG Starting direct message forwarding for peer `57f8de99f1371a7e1f5d0110f2` (stream: 10270)
2024-09-02 08:36:53.118917900 [comms::protocol::messaging::outbound] TRACE Message for peer '57f8de99f1371a7e1f5d0110f2' sending OutboundMessage (tag: Tag#11450491070897158372, 136 bytes) for peer '57f8de99f1371a7e' on stream 10270
2024-09-02 08:36:53.119003300 [comms::protocol::messaging::outbound] DEBUG Outbound messaging stream 10270 ended for peer 57f8de99f1371a7e1f5d0110f2.
2024-09-02 08:36:53.119023000 [comms::protocol::messaging::outbound] DEBUG Direct message forwarding successfully completed for peer `57f8de99f1371a7e1f5d0110f2` (stream: 10270).
2024-09-02 08:36:53.119041000 [comms::protocol::messaging::outbound] DEBUG Outbound messaging for peer '57f8de99f1371a7e1f5d0110f2' has stopped because the stream was closed
2024-09-02 08:36:53.119057200 [comms::protocol::messaging] TRACE Internal messaging event 'OutboundProtocolExited(57f8de99f1371a7e1f5d0110f2)'

Node 3 to node 1

Image

>> get-peer 3e1186f43d8ef1251fa11816fed69a494ce1392a61f5d7c645bf04a5d0459358
Emoji ID: πŸŽπŸŒ»πŸ°πŸš—πŸŽ€πŸΊπŸš’πŸ•πŸŽπŸ‘ŸπŸ‡πŸ₯‘πŸšΏπŸ”¦πŸ‘–πŸŽ£πŸŽ§πŸ˜‚πŸΈπŸ₯πŸ€πŸšœπŸ”§πŸ“ˆπŸŽ“πŸ’―πŸŽ―πŸ‘£πŸ”‹πŸŽ“πŸ‘€πŸŽ·πŸŽ·
Public Key: 3e1186f43d8ef1251fa11816fed69a494ce1392a61f5d7c645bf04a5d0459358
NodeId: 7dcc333714f9b31ebda93edec1
Addresses:
- /ip4/192.168.5.100/tcp/9991 Score: Some(200)  - Source: Config Latency: Some(5.883554ms) - Last Seen: 2024-08-31 05:23:23.196401800 - Last Failure:None
- /onion3/467lnghqsqdarhnmf7oi5t3roah3iipwdx6apxklhj2k5omztu2frpad:18141 Score: Some(200)  - Source: FromPeerConnection Latency: Some(6.784401ms) - Last Seen: 2024-08-30 15:03:53.937623 - Last Failure:None
User agent: tari/basenode/1.3.0-pre.0
Features: PeerFeatures(MESSAGE_PROPAGATION | DHT_STORE_FORWARD)
Flags: PeerFlags(SEED)
Supported protocols:
- t/mempool/1
- t/blksync/1
- t/dht/1
- t/bnwallet/1
- t/msg/0.1
- t/mempool-sync/1
Last seen: 2024-08-31 05:23:23.196401800
>>
>>
>> dial-peer 3e1186f43d8ef1251fa11816fed69a494ce1392a61f5d7c645bf04a5d0459358
>> ☎️  Dialing peer...
⚑️ Peer connected in 0ms!
Connection: Id: 174, Node ID: 7dcc333714f9b31e, Direction: Outbound, Peer Address: /ip4/192.168.5.100/tcp/9991, Age: 177773s, #Substreams: 0, #Refs: 2

>>
>> ping-peer 3e1186f43d8ef1251fa11816fed69a494ce1392a61f5d7c645bf04a5d0459358
πŸ“ Pinging peer...
>>
2024-09-02 08:37:23.958070800 [comms::middleware::message_logging] DEBUG Pre Broadcast [57f8de99f1371a7e]SendMsg(DirectNodeId(7dcc333714f9b31ebda93edec1) - <109 bytes>)
2024-09-02 08:37:23.958105700 [comms::dht::outbound::broadcast_middleware] TRACE Processing outbound request SendMsg(DirectNodeId(7dcc333714f9b31ebda93edec1) - <109 bytes>)
2024-09-02 08:37:23.958118200 [comms::dht::outbound::broadcast_middleware] TRACE Send params: FinalSendMessageParams { broadcast_strategy: DirectNodeId(NodeId(7dcc333714f9b31ebda93edec1)), destination: Unknown, encryption: ClearText, is_discovery_enabled: false, force_origin: false, dht_message_type: None, dht_message_flags: DhtMessageFlags(0x0), dht_header: None, debug_info: Some("Send direct to 7dcc333714f9b31ebda93edec1. Source: Start ping round"), tag: Some(MessageTag(5447954938505650479)) }
2024-09-02 08:37:23.958132600 [comms::dht::actor] TRACE DhtActor received request: SelectPeers (Strategy=DirectNodeId(7dcc333714f9b31ebda93edec1))
2024-09-02 08:37:23.958308800 [tari_comms::protocol::messaging::outbound] DEBUG comms::messaging::outbound; node_id="7dcc333714f9b31ebda93edec1"
2024-09-02 08:37:23.958341200 [comms::protocol::messaging::outbound] DEBUG Attempting to dial peer '7dcc333714f9b31ebda93edec1' if required
2024-09-02 08:37:23.958373700 [tari_comms::protocol::messaging::outbound] DEBUG establish_connection; node_id="7dcc333714f9b31ebda93edec1"
2024-09-02 08:37:23.958384400 [comms::protocol::messaging::outbound] DEBUG Attempting to establish messaging protocol connection to peer `7dcc333714f9b31ebda93edec1`
2024-09-02 08:37:23.958418100 [comms::connectivity::manager] TRACE Request: DialPeer { node_id: NodeId(7dcc333714f9b31ebda93edec1), reply_tx: Some(Sender { inner: Some(Inner { state: State { is_complete: false, is_closed: false, is_rx_task_set: true, is_tx_task_set: false } }) }) }
2024-09-02 08:37:23.958456500 [comms::protocol::messaging::outbound] DEBUG Connection succeeded for peer `7dcc333714f9b31ebda93edec1` in 53Β΅s
2024-09-02 08:37:23.958511700 [comms::protocol::messaging::outbound] DEBUG Substream established for peer `7dcc333714f9b31ebda93edec1`
2024-09-02 08:37:23.958539200 [tari_comms::protocol::messaging::outbound] DEBUG start_forwarding_messages; node_id="7dcc333714f9b31ebda93edec1"
2024-09-02 08:37:23.958547800 [comms::protocol::messaging::outbound] DEBUG Starting direct message forwarding for peer `7dcc333714f9b31ebda93edec1` (stream: 9487)
2024-09-02 08:37:23.958566600 [comms::protocol::messaging::outbound] TRACE Message for peer '7dcc333714f9b31ebda93edec1' sending OutboundMessage (tag: Tag#5447954938505650479, 133 bytes) for peer '7dcc333714f9b31e' on stream 9487
2024-09-02 08:37:23.958784500 [comms::protocol::messaging::outbound] DEBUG Outbound messaging stream 9487 ended for peer 7dcc333714f9b31ebda93edec1.
2024-09-02 08:37:23.958806400 [comms::protocol::messaging::outbound] DEBUG Direct message forwarding successfully completed for peer `7dcc333714f9b31ebda93edec1` (stream: 9487).
2024-09-02 08:37:23.958815900 [comms::protocol::messaging::outbound] DEBUG Outbound messaging for peer '7dcc333714f9b31ebda93edec1' has stopped because the stream was closed
2024-09-02 08:37:23.958831200 [comms::protocol::messaging] TRACE Internal messaging event 'OutboundProtocolExited(7dcc333714f9b31ebda93edec1)'
SWvheerden pushed a commit that referenced this issue Sep 5, 2024
Description
---
Cleared handshake error form the client side by forcefully disconnecting
form the peer.

Motivation and Context
---
This is to syop the phenomenon whereby a peer can dial another peer
(`dial-eer XXXXX`) but cannot ping that same peer (`ping-peer XXXXX`).
(See #6516)

How Has This Been Tested?
---
Unit tests pass
Long-term system-level testing has to be done to confirm this solved the
issue.

What process can a PR reviewer use to test or verify this change?
---
Review code changes.

<!-- Checklist -->
<!-- 1. Is the title of your PR in the form that would make nice release
notes? The title, excluding the conventional commit
tag, will be included exactly as is in the CHANGELOG, so please think
about it carefully. -->


Breaking Changes
---

- [x] None
- [ ] Requires data directory on base node to be deleted
- [ ] Requires hard fork
- [ ] Other - Please specify

<!-- Does this include a breaking change? If so, include this line as a
footer -->
<!-- BREAKING CHANGE: Description what the user should do, e.g. delete a
database, resync the chain -->
@hansieodendaal
Copy link
Contributor Author

See PR #6655

SWvheerden added a commit that referenced this issue Nov 1, 2024
#6655)

Description
---
Added check connections to the p2p services (`MonitorPeersService`). All
active connections are pinged on a set (slowish) interval (10 times
slower than the _auto ping metadata interval_). The nodes that do not
respond timeously on three consecutive iterations with a corresponding
pong are disconnected. This will help keep the list of active
connections (lazily) up to date.

**Edit:** 
Fixed an error in the liveness service where misbehaving ping peers were
never disconnected. The liveness service and monitor peers service work
hand in hand. Liveness selects 8 randomly connected peers to obtain
metadata from and will disconnect any of those that misbehave after 1
minute (2x ping intervals). The monitor peers service assesses all
connected peers at a much slower pace and disconnects non-responsive
peers after 15 minutes (10 x 3 ping intervals).

Motivation and Context
---
See #6516

How Has This Been Tested?
---
Performed system-level testing. From the log below we can see that 5 of
41 active peer connections did not respond with a ping. Peer
`e19e1454a1e0519866297960ad ` was disconnected because it did not
respond three times in a row,
```
2024-10-29 15:12:07.664466900 [minotari::base_node::monitor_peers] TRACE Found 5 of 41 outbound base node peer 
  connections that did not respond to pings
2024-10-29 15:12:07.664619800 [minotari::base_node::monitor_peers] TRACE Peer e2fa82050c2f7579febafb7e08 
  stats - (iteration, connected, responsive) [(3, true, true), (4, true, true), (5, true, false)]
2024-10-29 15:12:07.664683300 [minotari::base_node::monitor_peers] DEBUG Disconnecting e19e1454a1e0519866297960ad 
  as the peer is no longer responsive - (iteration, connected, responsive) [(2, true, true), (3, true, false), 
  (4, true, false), (5, true, false)]
2024-10-29 15:12:07.665853300 [minotari::base_node::monitor_peers] TRACE Peer 6ea597117476676d5ddcb18153 
  stats - (iteration, connected, responsive) [(1, true, true), (2, true, true), (3, true, true), (4, true, true), 
  (5, true, false)]
2024-10-29 15:12:07.665965500 [minotari::base_node::monitor_peers] TRACE Peer a671f812efe5ab14cbb3c1f9f4 
  stats - (iteration, connected, responsive) [(2, true, true), (3, true, true), (4, true, true), 
  (5, true, false)]
2024-10-29 15:12:07.665997800 [minotari::base_node::monitor_peers] TRACE Peer e336b264e02f611cf4fbf51f22 
  stats - (iteration, connected, responsive) [(2, true, true), (3, true, true), (4, true, true), 
  (5, true, false)]
```

What process can a PR reviewer use to test or verify this change?
---
- Code review
- System-level testing

<!-- Checklist -->
<!-- 1. Is the title of your PR in the form that would make nice release
notes? The title, excluding the conventional commit
tag, will be included exactly as is in the CHANGELOG, so please think
about it carefully. -->


Breaking Changes
---

- [x] None
- [ ] Requires data directory on base node to be deleted
- [ ] Requires hard fork
- [ ] Other - Please specify

<!-- Does this include a breaking change? If so, include this line as a
footer -->
<!-- BREAKING CHANGE: Description what the user should do, e.g. delete a
database, resync the chain -->

---------

Co-authored-by: SW van Heerden <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants