Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failsafes to prevent a consensus round from taking too long #5277

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented Feb 5, 2025

High Level Overview of Change

This PR, if merged, introduces two fail safes into the consensus logic to prevent a consensus round from remaining open indefinitely.

  1. Currently, if a disputed transaction remains disputed for at least 2x the time of the previous consensus round, the percentage of UNL validators required to vote "yes" to keep it in the set rises to 95%. This PR adds two additional cutoffs:
    1. If the transaction remains disputed for 4x the previous round, the percentage rises to 100%.
    2. Further, while it should be impossible, if the dispute remains unresolved for 5x, every node changes its vote to "no".
  2. Additionally, if the round as a whole takes more than 10x the time of the previous round (bounded just in case), then the round is considered "expired", and the node will leave the round, send a "partial validation" (indicating that the node is moving on without validating), and start the next round. When enough nodes leave the round, any remaining nodes will see they've fallen behind, and move on, too, generally before hitting the timeout. Any validations or partial validations sent during this time will help the consensus process bring the nodes back together.
    • The 10x time is bounded by ledgerMAX_CONSENSUS (15 seconds) and ledgerABANDON_CONSENSUS (60 seconds). This prevents an unusually fast consensus round from being punished into aborting unusually early on the next round, and prevents the potential round time from growing without bound. i.e. If one round takes 60 seconds, we don't want to let the next round run for 10 minutes.
    • There was discussion of adding a random factor into whether the node decides to leave the round. I decided against that for now because there's already a lot of variation in consensus round times, and magnifying that by 10 seemed good enough. Let me know if you disagree.

Context of Change

At about 9:54pm UTC on 2/4/2025, the network successfully validated ledger 93927173, and started the consensus round for 93927174. That round did not end for over an hour.

The current evidence indicates that two things happened.

  1. Some disputed transactions had just enough "yes" votes that validators voting "yes" saw the approval as just over 95%, while those voting "no" saw the approval as just under 95%. Thus, every node thought that it was doing the right thing, and no nodes changed their vote. While this is annoying, normally consensus will move on because at least 80% of the UNL validators will be in agreement over which transaction set to use, and so consensus moves on with that set. However,
  2. The disputed transactions with the close approval rates were distributed such that there were several clumps of validators voting yes for different transactions than other clumps of validators. This led to a situation where no transaction set had 80% approval.

This led to a deadlock-like situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.

This decision algorithm has been in place for at least 8 years, and possibly since the first release of rippled. The odds of it happening were thought to be 0, but it turns out they're just very very small.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

This change is fully backward and forward compatible, and does not require an amendment.

Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 62.50000% with 6 lines in your changes missing coverage. Please review.

Project coverage is 78.1%. Comparing base (02387fd) to head (7c5822b).

Files with missing lines Patch % Lines
src/xrpld/consensus/Consensus.h 25.0% 3 Missing ⚠️
src/xrpld/consensus/DisputedTx.h 25.0% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           develop   #5277   +/-   ##
=======================================
  Coverage     78.1%   78.1%           
=======================================
  Files          790     790           
  Lines        67607   67623   +16     
  Branches      8164    8167    +3     
=======================================
+ Hits         52828   52842   +14     
- Misses       14779   14781    +2     
Files with missing lines Coverage Δ
src/xrpld/consensus/Consensus.cpp 98.5% <100.0%> (+0.1%) ⬆️
src/xrpld/consensus/ConsensusParms.h 100.0% <100.0%> (ø)
src/xrpld/consensus/ConsensusTypes.h 74.4% <ø> (ø)
src/xrpld/consensus/Consensus.h 88.5% <25.0%> (-0.5%) ⬇️
src/xrpld/consensus/DisputedTx.h 92.3% <25.0%> (-3.6%) ⬇️

... and 3 files with indirect coverage changes

Impacted file tree graph

@ximinez ximinez changed the title Drop out of consensus if the round takes too long Failsafes to prevent a consensus round from taking too long Feb 5, 2025
@ximinez ximinez requested review from Bronek, JoelKatz and vlntb February 5, 2025 19:01
newPosition = weight > p.avSTUCK_CONSENSUS_PCT;
else
newPosition = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so simple that it's obviously correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been rewritten a bit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, the ending newPosition = false; remained, and I like that.

@@ -181,6 +181,12 @@ checkConsensus(
return ConsensusState::MovedOn;
}

if (currentAgreeTime > parms.ledgerMAX_CONSENSUS + previousAgreeTime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that because of this condition here, we are unable to test the change in DisputedTx.h - can you engineer timings such that we will test the last newPosition = false in DisputedTx::updateVote as well ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that because of this condition here, we are unable to test the change in DisputedTx.h - can you engineer timings such that we will test the last newPosition = false in DisputedTx::updateVote as well ?

This has been revised, too.

… long

- Require 100% agreement if the round takes 4x
- Change all votes to "no" if the round takes 5x
@ximinez ximinez marked this pull request as ready for review February 5, 2025 23:16
@ximinez ximinez requested a review from Bronek February 6, 2025 00:02
* upstream/develop:
  Amendment `fixFrozenLPTokenTransfer` (5227)
  Improve git commit hash lookup (5225)
Copy link
Collaborator

@Bronek Bronek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cool to have a unit test for the last part of DisputedTx.h ; not sure how realistic that request is. Approved in any case.

* upstream/develop:
  Updates Conan dependencies (5256)
* upstream/develop:
  fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275)
  fix: issues in `simulate` RPC (5265)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants