Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(comms): simplify and remove possibility of deadlock from pipelines and substream close #4676

Conversation

sdbondi
Copy link
Member

@sdbondi sdbondi commented Sep 14, 2022

Description

  • Simplify outbound pipeline by removing the [pipeline] -> [messaging] channel
  • Pipe outbound messages directly to the messaging protocol instead of through the outbound pipeline
  • Fix rare lockup when calling yamux control close

Motivation and Context

The outbound pipeline needed to poll two channels in order to make progress, some code branches in the outbound pipeline may need to use the other channel, and if that channel is full and the number of concurrent outbound tasks are full, a deadlock will occur. This case has not been directly observed, but is technically possible so should be eliminated.

This PR removes the [pipeline] -> [messaging] channel, making the outbound pipeline only have to poll one channel. It also directly pipes OutboundMessages to the messaging protocol.

EDIT: I believe I've found the root cause. The connectivity manager would rarely "lock up" causing the pipelines to lock up (both pipelines require calls to connectivity manager). I traced this in the logs and found that the last thing the connectivity manager does is resolve a tie break before locking up. This involves disconnecting one of the peer connections, and it appeared this future, extremely rarely, did not resolve. Digging deeper from there, I was able to track down a flaw in the substream close procedure, write a test that reproduces it and make a fix.

How Has This Been Tested?

Number of ~1000-2000tx stress tests, leaving base nodes overnight (none of these are conclusive but no issues were encountered)

@sdbondi sdbondi force-pushed the comms-messaging-simplify-outbound-pipeline branch from 0707845 to f16051b Compare September 14, 2022 04:34
hansieodendaal
hansieodendaal previously approved these changes Sep 14, 2022
Copy link
Contributor

@hansieodendaal hansieodendaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I like the simplified approach.

(I am going to run a biggish stress test with this code, maybe it will solve #4637)

comms/core/src/pipeline/outbound.rs Outdated Show resolved Hide resolved
@sdbondi sdbondi force-pushed the comms-messaging-simplify-outbound-pipeline branch 3 times, most recently from dd26b9d to 29e2e1f Compare September 15, 2022 05:16
@sdbondi sdbondi changed the title fix(comms/pipeline): simplify and remove possibility of deadlock fix(comms): simplify and remove possibility of deadlock from pipelines and substream close Sep 15, 2022
@sdbondi sdbondi force-pushed the comms-messaging-simplify-outbound-pipeline branch from 29e2e1f to 739b881 Compare September 15, 2022 06:28
@sdbondi sdbondi force-pushed the comms-messaging-simplify-outbound-pipeline branch from 739b881 to 91bfcb2 Compare September 15, 2022 06:37
@stringhandler stringhandler merged commit f41bcf9 into tari-project:development Sep 15, 2022
@sdbondi sdbondi deleted the comms-messaging-simplify-outbound-pipeline branch September 15, 2022 09:29
sdbondi added a commit to sdbondi/tari that referenced this pull request Sep 16, 2022
* development: (72 commits)
  fix: reinsert transactions from failed block (tari-project#4675)
  fix: stray clippy error (tari-project#4685)
  fix(wallet): mark mined_height as null when pending outputs are cancelled (tari-project#4686)
  chore: updated dependancies (tari-project#4684)
  fix(p2p): remove DETACH flag usage (tari-project#4682)
  fix(comms): simplify and remove possibility of deadlock from pipelines and substream close (tari-project#4676)
  feat(ci): add default CI and FFI testing with custom dispatch (tari-project#4672)
  chore: remove broken test (tari-project#4678)
  fix: fix potential race condition between add_block and sync (tari-project#4677)
  fix deadlock (tari-project#4674)
  fix: add burn funds command to console wallet (see issue tari-project#4547) (tari-project#4655)
  v0.38.3
  fix: fee estimate (tari-project#4656)
  fix(comms/messaging): fix possible deadlock in outbound pipeline (tari-project#4657)
  fix: replace Luhn checksum with DammSum (tari-project#4639)
  fix(core/sync): handle deadline timeouts by changing peer (tari-project#4649)
  fix(ci): libtor build on Ubuntu (tari-project#4644)
  chore: fix log (tari-project#4634)
  v0.38.2
  fix(comms/rpc): detect early close in all cases (tari-project#4647)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants