Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: end stale outbound queue immediately on disconnect, auto retry outbound messages #3664

Conversation

sdbondi
Copy link
Member

@sdbondi sdbondi commented Dec 16, 2021

Description

  • immediately end the outbound messaging stream on peer connection disconnect
  • if messages remain in the outbound channel, flush them to a retry queue

Motivation and Context

Due to yamux substream internals, the outbound stream does not indicate it has ended until we attempt to write a message. Causing the outbound stream to remain active (i.e. available for sending) when it cannot send a message. This message and potentially any others that are queued up will then be dropped. This is rectified by cutting off the outbound queue channel as soon as a connection is disconnected and 'rerouting' the queued messages (if any) to be retried. Retry will attempt to reconnect to the peer and failing so, will drop the queued messages since there is nothing we can do in this case.

This increases messaging reliability.

How Has This Been Tested?

Manually by sending transaction and ping messages between wallets and base nodes that are banned/shutdown

@stringhandler stringhandler merged commit 576a00c into tari-project:weatherwax Jan 3, 2022
@sdbondi sdbondi deleted the comms-messaging-outbound-clear-stale-asap branch January 4, 2022 05:12
sdbondi added a commit to sdbondi/tari that referenced this pull request Jan 7, 2022
* development:
  chore: remove moving lock.mdb (tari-project#3674)
  chore: merge weatherwax
  feat!: provide a compact form of TransactionInput (tari-project#3460)
  v0.22.1.1
  v0.22.1
  ci: add build step (tari-project#3678)
  fix: edge cases causing bans during header/block sync (tari-project#3661)
  fix: end stale outbound queue immediately on disconnect, retry outbound messages (tari-project#3664)
  feat: add search by commitment to explorer (tari-project#3668)
  feat: tari launchpad (tari-project#3671)
  feat: base_node switching for console_wallet when status is offline (tari-project#3639)
  feat: improve wallet recovery and scanning handling of reorgs (tari-project#3655)
  feat: add GRPC call to search for utxo via commitment hex (tari-project#3666)
  feat: custom_base_node in config (tari-project#3651)
  fix: return correct index for include_pruned_utxos = false (tari-project#3663)
sdbondi added a commit to sdbondi/tari that referenced this pull request Jan 10, 2022
* development:
  feat: dibbler new genesis block with faucet utxos (tari-project#3688)
  ci: fix clippy warning on generated proto module (tari-project#3690)
  test: fix metadata signature cucumber (tari-project#3687)
  refactor!: clean up #testnet reset TODOs (tari-project#3682)
  feat(comms)!: add signature to peer identity to allow third party identity updates (tari-project#3629)
  chore: remove moving lock.mdb (tari-project#3674)
  chore: merge weatherwax
  v0.22.1.1
  v0.22.1
  ci: add build step (tari-project#3678)
  fix: edge cases causing bans during header/block sync (tari-project#3661)
  fix: end stale outbound queue immediately on disconnect, retry outbound messages (tari-project#3664)
  feat: add search by commitment to explorer (tari-project#3668)
  feat: tari launchpad (tari-project#3671)
  feat: base_node switching for console_wallet when status is offline (tari-project#3639)
  feat: improve wallet recovery and scanning handling of reorgs (tari-project#3655)
  feat: add GRPC call to search for utxo via commitment hex (tari-project#3666)
  feat: custom_base_node in config (tari-project#3651)
  fix: return correct index for include_pruned_utxos = false (tari-project#3663)
sdbondi added a commit to sdbondi/tari that referenced this pull request Jan 11, 2022
* development: (28 commits)
  feat: covenants implementation (tari-project#3656)
  ci: add tor script to binaries bundle (tari-project#3689)
  chore: remove testnet reset todo in wallet (tari-project#3685)
  feat: dibbler new genesis block with faucet utxos (tari-project#3688)
  ci: fix clippy warning on generated proto module (tari-project#3690)
  test: fix metadata signature cucumber (tari-project#3687)
  refactor!: clean up #testnet reset TODOs (tari-project#3682)
  feat(comms)!: add signature to peer identity to allow third party identity updates (tari-project#3629)
  chore: remove moving lock.mdb (tari-project#3674)
  chore: merge weatherwax
  feat!: provide a compact form of TransactionInput (tari-project#3460)
  fix: allow 0-conf in blockchain db (tari-project#3680)
  v0.22.1.1
  v0.22.1
  ci: add build step (tari-project#3678)
  test: fix cucumber WalletQuery.feature (tari-project#3677)
  test: fix `wait for` step (tari-project#3673)
  fix: edge cases causing bans during header/block sync (tari-project#3661)
  perf(comms)!: optimise connection establishment (tari-project#3658)
  fix: end stale outbound queue immediately on disconnect, retry outbound messages (tari-project#3664)
  ...
aviator-app bot pushed a commit that referenced this pull request Jan 12, 2022
Description
---

- Handle case where connection is disconnected due to simultaneous dial can cause new connection to be removed from connectivity state
- Remove messaging protocol timeouts, messaging protocol will end correctly when disconnected (ref #3664)
- Remove condition that connection should have 0 substreams for reaping. A connection may only be reaped if it's age is >= 20 minutes and there are 0 handles held for the connection
- Detect if remote outbound stream is closed, by reading from it (yamux requires reading to determine if the stream is still alive)
- Ensure disconnect event can only be emitted once per peer connection
- remove delayed connection close and "will close" event 

Motivation and Context
---
Observed a case where two nodes simultaneously dial and end up with no connections to each other due to a mis-timed peer disconnected event. The connection would have to wait for DHT connectivity to eventually redial to recover. Inactivity timeouts for messaging were a patch for the outbound messaging staying open, but this is not properly detected and handled.

How Has This Been Tested?
---
Existing tests updated
Manually: two base nodes that previously "lost" their connection due to this bug
@sdbondi sdbondi restored the comms-messaging-outbound-clear-stale-asap branch February 3, 2022 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants