-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Properly handling stalls due to 1-height differences #3175
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the diff with Michael's PR is not shown, indicated all the places in which I deviated from his original code.
msg_type = msg.enum if msg.enum != 'Routed' else msg.Routed.body.enum | ||
logging.info( | ||
f'NODE {self.ordinal} blocking message {msg_type} from {fr} to {to}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compared to Michael's original PR I changed logging here to include the message type.
_, cur_height = get_recent_hash(nodes[-1], 30) | ||
if cur_height == last_height and time.time() - last_time_height_updated > 10: | ||
time.sleep(25) | ||
else: | ||
last_height = cur_height | ||
last_time_height_updated = time.time() | ||
time.sleep(5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also different from the original PR, in which it was a constant 9s sleep.
If the cluster got split into two halves with 1 height difference, it takes 8-9 seconds to detect it, and then some time to recover, so sleeping longer when the cluster stalls.
proxy = RejectListProxy(reject_list) | ||
expect_network_issues() | ||
block_timeout += 40 | ||
balances_timeout += 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added 20s extra to wait for transactions to apply when the network is being messed up with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am somewhat confused. You removed NoSyncSeveralBlocksBehind
but there seems to be no replacement mechanism to start syncing when the node is one block behind. I understand that they will receive the broadcasted head from other peers, but that won't initiate syncing.
@@ -52,19 +52,12 @@ pub const MAX_PENDING_PART: u64 = MAX_STATE_PART_REQUEST * 10000; | |||
pub const NS_PER_SECOND: u128 = 1_000_000_000; | |||
|
|||
/// Get random peer from the hightest height peers. | |||
pub fn highest_height_peer( | |||
highest_height_peers: &Vec<FullPeerInfo>, | |||
min_height: BlockHeight, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a workaround to make sure the syncing doesn't end prematurely (in the old approach once we start syncing, the code could choose a peer that is, same as us, one height behind and immediately stop syncing).
Now that we don't care about being one height behind, we don't need this logic anymore.
They don't need to sync if they are one block behind. Once the head is broadcasted from the nodes that are 1 height ahead, it will be accepted by the nodes that are 1 block behind making then in-sync, and thus syncing is not needed. |
Codecov Report
@@ Coverage Diff @@
## master #3175 +/- ##
=======================================
Coverage 87.74% 87.74%
=======================================
Files 212 212
Lines 42409 42409
=======================================
Hits 37211 37211
Misses 5198 5198 Continue to review full report at Codecov.
|
The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up. If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall. Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress. With this change `local_network` mode passes in nayduck. Test plan: --------- Before this change stress.py local_network fails consistently due to chain stalling. After: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104 Out of five failures the first one is new (and needs debugging), but the chain is not stalled. The remaining four are #3169
a9de901
to
e9cd3ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reposting my comment for posterity: actually I think we should check node’s sync status because during syncing we can be in a situation where head doesn’t move for a long time and you don’t want it to broadcast old head
This change builds on top of Michael's changes to introduce
local_network
mode to stress.pyThe nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up.
If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall.
Fixing it by re-broadcasting
head
if during some reasonable time the network made no progress.With this change
local_network
mode passes in nayduck.Test plan:
Before this change stress.py local_network fails consistently due to chain stalling.
After:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104
Out of five failures the first one is new (and needs debugging), but the chain is not stalled.
The remaining four are #3169