You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a leader is not fully synced when taking over as leader after a view change, it may use an outdated blockHash or public key set. We should verify that, upon view change, the leader’s block and view state is fully updated before it starts processing PREPARE messages. Maybe checking for race conditions or delays in leader sync, especially after view changes.
Another point here is: Since the leader processes all PREPARE messages from validators, it’s possible that only the leader encounters certain edge cases if recvMsg.Payload contains unexpected data. Maybe we can put some logs to see if a certain payload or data is triggering some edge cases.
on the rotateLeader we find the committee like this:
if the validator is not fully synced specially on view change (even 1 block difference), it detects the wrong epoch and it can cause the wrong leader.
There is a variable called IgnoreViewIDCheck. This is the comment for this variable:
in a very rare case, when a M1 view change happened, the block contains coinbase for last leader but the new leader is actually recognized by most of the nodes. At this time, if a node sync to this exact block and set its leader, it will set with the failed leader as in the coinbase of the block.
This is a very rare case scenario and not likely to cause any issue in mainnet. But we need to think about a solution to take care of this case because the coinbase of the latest block doesn't really represent the the real current leader in case of M1 view change.
This indicates that, in rare instances, the coinbase of a block may not correctly represent the actual current leader. As a result, ensuring that the leader rotation mechanism accurately reflects the active leader in all network states is essential.
The leader rotation mechanism includes a timeout for the next leader to produce and propagate a block. However, if the new leader has poor connectivity, it may delay the block’s propagation across the network. In this scenario, some nodes might receive the block and continue with consensus, while others that haven’t yet received it will trigger another leader rotation, waiting for the subsequent leader to produce a block. This can lead to network desynchronization, where part of the network moves forward while other nodes remain waiting, impacting overall consensus stability and efficiency.
Unaddressed Edge Cases in Leader Rotation Calculations
In the code for leader rotation, several corner cases could lead to inefficiencies or even errors, particularly around how the next leader is selected:
Skipping the Validators: When next is greater than zero, there’s a possibility of inadvertently skipping over some validators in the list if the idx variable, after wrapping around from the multi-bls keys validator, fails to account them sequentially. This can lead to unfair rotation as the validator might not get selected even though it’s eligible. For example if next=1 and first validator has 4 bls keys and second validator has only 1 key, it won't be selected because the idx would be 0, 1,3,5 which won't select validator with ID=4. (idx starts from 0)
Potential Crash When pubKey Is Not Found: If the specified pubKey isn’t present in publicKeys, idx will be -1. In cases where next is zero, this immediately results in attempting to access publicKeys[idx], which can crash the program due to an invalid index. This is especially problematic if next is set to zero deliberately, as it doesn’t provide a safe fallback when pubKey is absent in the list.
Repeated Selection of the Same Leader: When no unique validator is found, the original function simply returns the pubKey it was given as the next leader, potentially resulting in multiple consecutive selections of the same validator if conditions don’t allow for finding a new one. This creates a bottleneck, as the same leader might be continuously reselected even if a rotation is intended.
Blocks Produced Check: The function uses numBlocksProducedByLeader to determine if the current leader has produced enough blocks before rotation. If s.Count (the number of blocks produced by the leader) is less than this value, the function keeps the same leader. Any inconsistency in s.Count could prevent rotation from occurring as expected, potentially leading to the current leader holding the role longer than intended.
Block Header Access: The function retrieves headers of recent blocks (bc.GetHeaderByNumber) to populate a mask with the bitmap of signatures from validators. If the function fails to retrieve the correct block headers (for instance, if they are missing or corrupted), it will return defaultKey, which might prevent expected leader rotation or lead to inconsistent leader selection.
Questions
We should ensure that UpdatePublicKeys accurately updates the public keys immediately before selecting the next leader. It's important to review and double-check the code to verify that the public keys are current and correctly set for leader rotation.
Can we clarify the role of the found variable in the leader rotation code? It’s unclear if it solely indicates the presence of the original public key, or if it’s intended to confirm the successful calculation of the next leader. Also, is there a reason we aren’t checking pubKey != nil? We need to ensure that found specifically reflects whether the original key was found, independent of whether the next leader has been calculated successfully.
LeaderRotation Frequency
The frequency of leader rotation in blockchain networks can have a significant impact on both performance and stability. Rotating leaders, as some blockchains currently do, promotes decentralization and theoretically limits the impact of a poor leader. However, the rapid turnover also introduces substantial performance costs. Each new leader must synchronize and verify its state, ensure connectivity, and handle any network latency issues (all of which add overhead to each rotation). This can slow down the overall block production process, especially if a new leader has weak connectivity or high latency.
In contrast, networks like Solana rotate leaders less frequently; every 4 blocks. This allows for smoother transitions and greater operational efficiency, as leaders can handle multiple blocks consecutively, reducing setup time and improving throughput. Solana’s approach leverages additional innovations like Gulf Stream protocol for pre-forwarding transactions to the next leader, which helps mitigate transition overhead and keep block production fast and uninterrupted. So, in the Harmony network, it is hardcoded to 3 blocks per leader, but as we don’t have other mechanisms to smoothen transition, by giving each leader a few more blocks to process, we can minimize frequent sync processes and also, it allows the network to maintain a high level of transaction throughput.
A reasonable leader rotation interval might be somewhere between 128 and 256 blocks (about 4.2 - 8.5 minutes) which would be enough for 128 to 256 leaders per epoch. This medium frequency would reduce the constant setup costs of changing leaders every block while ensuring that no single leader retains control for too long. Additionally, implementing a performance-based scoring system could allow the network to prioritize high-performance validators as leaders, enhancing stability and responsiveness during high-load periods. Such an approach balances decentralization with network performance, minimizing the disruptions of leader changes while maintaining resilience.
Optimal Rotation Frequency Considerations
Given the need for balance, the best approach may be to focus on adaptive and data-driven strategies:
Dynamic Rotation Based on Leader Performance: Instead of fixed intervals, dynamically adjust the rotation interval based on each leader’s performance during their rotation term. For instance, keep a leader in place if they’re performing well, or rotate faster if performance deteriorates.
Network Condition-Based Rotation: In unstable networks, shorter rotations can mitigate the risk of downtime, while more stable networks may benefit from longer intervals.
Scoring Mechanism for Leader Readiness: A scoring system based on network metrics (latency, synchronization success rate, bandwidth, etc.) could determine the most optimal interval. Leaders with high scores could remain longer, while lower scores could trigger quicker rotations.
The text was updated successfully, but these errors were encountered:
Detected Edge Cases
If a leader is not fully synced when taking over as leader after a view change, it may use an outdated blockHash or public key set. We should verify that, upon view change, the leader’s block and view state is fully updated before it starts processing
PREPARE
messages. Maybe checking for race conditions or delays in leader sync, especially after view changes.Another point here is: Since the leader processes all
PREPARE
messages from validators, it’s possible that only the leader encounters certain edge cases ifrecvMsg.Payload
contains unexpected data. Maybe we can put some logs to see if a certain payload or data is triggering some edge cases.on the
rotateLeader
we find the committee like this:if the validator is not fully synced specially on view change (even 1 block difference), it detects the wrong epoch and it can cause the wrong leader.
IgnoreViewIDCheck
. This is the comment for this variable:This indicates that, in rare instances, the coinbase of a block may not correctly represent the actual current leader. As a result, ensuring that the leader rotation mechanism accurately reflects the active leader in all network states is essential.
Unaddressed Edge Cases in Leader Rotation Calculations
In the code for leader rotation, several corner cases could lead to inefficiencies or even errors, particularly around how the next leader is selected:
Skipping the Validators: When
next
is greater than zero, there’s a possibility of inadvertently skipping over some validators in the list if theidx
variable, after wrapping around from the multi-bls keys validator, fails to account them sequentially. This can lead to unfair rotation as the validator might not get selected even though it’s eligible. For example ifnext=1
and first validator has 4 bls keys and second validator has only 1 key, it won't be selected because the idx would be 0, 1,3,5 which won't select validator withID=4
. (idx starts from 0)Potential Crash When pubKey Is Not Found: If the specified
pubKey
isn’t present inpublicKeys
,idx
will be-1
. In cases where next is zero, this immediately results in attempting to accesspublicKeys[idx]
, which can crash the program due to an invalid index. This is especially problematic ifnext
is set to zero deliberately, as it doesn’t provide a safe fallback whenpubKey
is absent in the list.Repeated Selection of the Same Leader: When no unique validator is found, the original function simply returns the
pubKey
it was given as the next leader, potentially resulting in multiple consecutive selections of the same validator if conditions don’t allow for finding a new one. This creates a bottleneck, as the same leader might be continuously reselected even if a rotation is intended.Blocks Produced Check: The function uses
numBlocksProducedByLeader
to determine if the current leader has produced enough blocks before rotation. Ifs.Count
(the number of blocks produced by the leader) is less than this value, the function keeps the same leader. Any inconsistency ins.Count
could prevent rotation from occurring as expected, potentially leading to the current leader holding the role longer than intended.Block Header Access: The function retrieves headers of recent blocks (
bc.GetHeaderByNumber
) to populate a mask with the bitmap of signatures from validators. If the function fails to retrieve the correct block headers (for instance, if they are missing or corrupted), it will returndefaultKey
, which might prevent expected leader rotation or lead to inconsistent leader selection.Questions
We should ensure that
UpdatePublicKeys
accurately updates the public keys immediately before selecting the next leader. It's important to review and double-check the code to verify that the public keys are current and correctly set for leader rotation.Can we clarify the role of the
found
variable in the leader rotation code? It’s unclear if it solely indicates the presence of the original public key, or if it’s intended to confirm the successful calculation of the next leader. Also, is there a reason we aren’t checkingpubKey != nil
? We need to ensure thatfound
specifically reflects whether the original key was found, independent of whether the next leader has been calculated successfully.LeaderRotation Frequency
The frequency of leader rotation in blockchain networks can have a significant impact on both performance and stability. Rotating leaders, as some blockchains currently do, promotes decentralization and theoretically limits the impact of a poor leader. However, the rapid turnover also introduces substantial performance costs. Each new leader must synchronize and verify its state, ensure connectivity, and handle any network latency issues (all of which add overhead to each rotation). This can slow down the overall block production process, especially if a new leader has weak connectivity or high latency.
In contrast, networks like Solana rotate leaders less frequently; every 4 blocks. This allows for smoother transitions and greater operational efficiency, as leaders can handle multiple blocks consecutively, reducing setup time and improving throughput. Solana’s approach leverages additional innovations like Gulf Stream protocol for pre-forwarding transactions to the next leader, which helps mitigate transition overhead and keep block production fast and uninterrupted. So, in the Harmony network, it is hardcoded to 3 blocks per leader, but as we don’t have other mechanisms to smoothen transition, by giving each leader a few more blocks to process, we can minimize frequent sync processes and also, it allows the network to maintain a high level of transaction throughput.
A reasonable leader rotation interval might be somewhere between 128 and 256 blocks (about 4.2 - 8.5 minutes) which would be enough for 128 to 256 leaders per epoch. This medium frequency would reduce the constant setup costs of changing leaders every block while ensuring that no single leader retains control for too long. Additionally, implementing a performance-based scoring system could allow the network to prioritize high-performance validators as leaders, enhancing stability and responsiveness during high-load periods. Such an approach balances decentralization with network performance, minimizing the disruptions of leader changes while maintaining resilience.
Optimal Rotation Frequency Considerations
Given the need for balance, the best approach may be to focus on adaptive and data-driven strategies:
Dynamic Rotation Based on Leader Performance: Instead of fixed intervals, dynamically adjust the rotation interval based on each leader’s performance during their rotation term. For instance, keep a leader in place if they’re performing well, or rotate faster if performance deteriorates.
Network Condition-Based Rotation: In unstable networks, shorter rotations can mitigate the risk of downtime, while more stable networks may benefit from longer intervals.
Scoring Mechanism for Leader Readiness: A scoring system based on network metrics (latency, synchronization success rate, bandwidth, etc.) could determine the most optimal interval. Leaders with high scores could remain longer, while lower scores could trigger quicker rotations.
The text was updated successfully, but these errors were encountered: