Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Properly handling stalls due to 1-height differences #3175

Merged
merged 3 commits into from
Aug 15, 2020

Conversation

SkidanovAlex
Copy link
Collaborator

This change builds on top of Michael's changes to introduce local_network mode to stress.py

The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up.
If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall.

Fixing it by re-broadcasting head if during some reasonable time the network made no progress.

With this change local_network mode passes in nayduck.

Test plan:

Before this change stress.py local_network fails consistently due to chain stalling.
After:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104

Out of five failures the first one is new (and needs debugging), but the chain is not stalled.
The remaining four are #3169

@gitpod-io
Copy link

gitpod-io bot commented Aug 15, 2020

Copy link
Collaborator Author

@SkidanovAlex SkidanovAlex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the diff with Michael's PR is not shown, indicated all the places in which I deviated from his original code.

Comment on lines +13 to +15
msg_type = msg.enum if msg.enum != 'Routed' else msg.Routed.body.enum
logging.info(
f'NODE {self.ordinal} blocking message {msg_type} from {fr} to {to}')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to Michael's original PR I changed logging here to include the message type.

Comment on lines +177 to +183
_, cur_height = get_recent_hash(nodes[-1], 30)
if cur_height == last_height and time.time() - last_time_height_updated > 10:
time.sleep(25)
else:
last_height = cur_height
last_time_height_updated = time.time()
time.sleep(5)
Copy link
Collaborator Author

@SkidanovAlex SkidanovAlex Aug 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also different from the original PR, in which it was a constant 9s sleep.
If the cluster got split into two halves with 1 height difference, it takes 8-9 seconds to detect it, and then some time to recover, so sleeping longer when the cluster stalls.

proxy = RejectListProxy(reject_list)
expect_network_issues()
block_timeout += 40
balances_timeout += 20
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added 20s extra to wait for transactions to apply when the network is being messed up with.

Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am somewhat confused. You removed NoSyncSeveralBlocksBehind but there seems to be no replacement mechanism to start syncing when the node is one block behind. I understand that they will receive the broadcasted head from other peers, but that won't initiate syncing.

@@ -52,19 +52,12 @@ pub const MAX_PENDING_PART: u64 = MAX_STATE_PART_REQUEST * 10000;
pub const NS_PER_SECOND: u128 = 1_000_000_000;

/// Get random peer from the hightest height peers.
pub fn highest_height_peer(
highest_height_peers: &Vec<FullPeerInfo>,
min_height: BlockHeight,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

Copy link
Collaborator Author

@SkidanovAlex SkidanovAlex Aug 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a workaround to make sure the syncing doesn't end prematurely (in the old approach once we start syncing, the code could choose a peer that is, same as us, one height behind and immediately stop syncing).
Now that we don't care about being one height behind, we don't need this logic anymore.

@SkidanovAlex
Copy link
Collaborator Author

SkidanovAlex commented Aug 15, 2020

They don't need to sync if they are one block behind. Once the head is broadcasted from the nodes that are 1 height ahead, it will be accepted by the nodes that are 1 block behind making then in-sync, and thus syncing is not needed.

@codecov
Copy link

codecov bot commented Aug 15, 2020

Codecov Report

Merging #3175 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3175   +/-   ##
=======================================
  Coverage   87.74%   87.74%           
=======================================
  Files         212      212           
  Lines       42409    42409           
=======================================
  Hits        37211    37211           
  Misses       5198     5198           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5538a7c...1176791. Read the comment docs.

The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up.
If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall.

Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress.

With this change `local_network` mode passes in nayduck.

Test plan:
---------
Before this change stress.py local_network fails consistently due to chain stalling.
After:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104

Out of five failures the first one is new (and needs debugging), but the chain is not stalled.
The remaining four are #3169
Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reposting my comment for posterity: actually I think we should check node’s sync status because during syncing we can be in a situation where head doesn’t move for a long time and you don’t want it to broadcast old head

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants