fix: Properly handling stalls due to 1-height differences #3175

SkidanovAlex · 2020-08-15T16:06:52Z

This change builds on top of Michael's changes to introduce local_network mode to stress.py

The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up.
If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall.

Fixing it by re-broadcasting head if during some reasonable time the network made no progress.

With this change local_network mode passes in nayduck.

Test plan:

Before this change stress.py local_network fails consistently due to chain stalling.
After:
http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104

Out of five failures the first one is new (and needs debugging), but the chain is not stalled.
The remaining four are #3169

…nix utilities

gitpod-io · 2020-08-15T16:06:55Z

SkidanovAlex

Given the diff with Michael's PR is not shown, indicated all the places in which I deviated from his original code.

SkidanovAlex · 2020-08-15T16:09:25Z

pytest/lib/proxy_instances.py

+            msg_type = msg.enum if msg.enum != 'Routed' else msg.Routed.body.enum
+            logging.info(
+                f'NODE {self.ordinal} blocking message {msg_type} from {fr} to {to}')


Compared to Michael's original PR I changed logging here to include the message type.

SkidanovAlex · 2020-08-15T16:10:31Z

pytest/tests/stress/stress.py

+        _, cur_height = get_recent_hash(nodes[-1], 30)
+        if cur_height == last_height and time.time() - last_time_height_updated > 10:
+            time.sleep(25)
+        else:
+            last_height = cur_height
+            last_time_height_updated = time.time()
+            time.sleep(5)


This is also different from the original PR, in which it was a constant 9s sleep.
If the cluster got split into two halves with 1 height difference, it takes 8-9 seconds to detect it, and then some time to recover, so sleeping longer when the cluster stalls.

SkidanovAlex · 2020-08-15T16:11:26Z

pytest/tests/stress/stress.py

+        proxy = RejectListProxy(reject_list)
+        expect_network_issues()
+        block_timeout += 40
+        balances_timeout += 20


Also added 20s extra to wait for transactions to apply when the network is being messed up with.

bowenwang1996

I am somewhat confused. You removed NoSyncSeveralBlocksBehind but there seems to be no replacement mechanism to start syncing when the node is one block behind. I understand that they will receive the broadcasted head from other peers, but that won't initiate syncing.

bowenwang1996 · 2020-08-15T16:10:54Z

chain/client/src/sync.rs

@@ -52,19 +52,12 @@ pub const MAX_PENDING_PART: u64 = MAX_STATE_PART_REQUEST * 10000;
 pub const NS_PER_SECOND: u128 = 1_000_000_000;

 /// Get random peer from the hightest height peers.
-pub fn highest_height_peer(
-    highest_height_peers: &Vec<FullPeerInfo>,
-    min_height: BlockHeight,


why is this removed?

It was a workaround to make sure the syncing doesn't end prematurely (in the old approach once we start syncing, the code could choose a peer that is, same as us, one height behind and immediately stop syncing).
Now that we don't care about being one height behind, we don't need this logic anymore.

SkidanovAlex · 2020-08-15T16:32:20Z

They don't need to sync if they are one block behind. Once the head is broadcasted from the nodes that are 1 height ahead, it will be accepted by the nodes that are 1 block behind making then in-sync, and thus syncing is not needed.

codecov · 2020-08-15T17:02:49Z

Codecov Report

Merging #3175 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3175   +/-   ##
=======================================
  Coverage   87.74%   87.74%           
=======================================
  Files         212      212           
  Lines       42409    42409           
=======================================
  Hits        37211    37211           
  Misses       5198     5198

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5538a7c...1176791. Read the comment docs.

The nodes only learn about their peer's highest height via Block messages, and thus if a block message is lost and two nodes are exactly one height away from each other, the one behind will never learn it needs to catch up. If now such a 1-height difference is split across two rouhgly even halves of the block producers, the system will stall. Fixing it by re-broadcasting `head` if during some reasonable time the network made no progress. With this change `local_network` mode passes in nayduck. Test plan: --------- Before this change stress.py local_network fails consistently due to chain stalling. After: http://nayduck.eastus.cloudapp.azure.com:3000/#/run/104 Out of five failures the first one is new (and needs debugging), but the chain is not stalled. The remaining four are #3169

bowenwang1996

reposting my comment for posterity: actually I think we should check node’s sync status because during syncing we can be in a situation where head doesn’t move for a long time and you don’t want it to broadcast old head

fix(pytest): use python proxy framework to isolate nodes instead of u…

6f3681d

…nix utilities

SkidanovAlex requested review from bowenwang1996 and birchmd August 15, 2020 16:06

SkidanovAlex commented Aug 15, 2020

View reviewed changes

bowenwang1996 reviewed Aug 15, 2020

View reviewed changes

SkidanovAlex force-pushed the stress-proxy2 branch from a9de901 to e9cd3ed Compare August 15, 2020 21:53

Merge branch 'master' into stress-proxy2

1176791

bowenwang1996 approved these changes Aug 15, 2020

View reviewed changes

bowenwang1996 reviewed Aug 15, 2020

View reviewed changes

SkidanovAlex merged commit b90822f into master Aug 15, 2020

birchmd mentioned this pull request Aug 17, 2020

fix(pytest): use python proxy framework to isolate nodes instead of unix utilities #3076

Closed

Ekleog-NEAR deleted the stress-proxy2 branch March 29, 2024 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Properly handling stalls due to 1-height differences #3175

fix: Properly handling stalls due to 1-height differences #3175

SkidanovAlex commented Aug 15, 2020

gitpod-io bot commented Aug 15, 2020 •

edited

Loading

SkidanovAlex left a comment

SkidanovAlex Aug 15, 2020

SkidanovAlex Aug 15, 2020 •

edited

Loading

SkidanovAlex Aug 15, 2020

bowenwang1996 left a comment

bowenwang1996 Aug 15, 2020

SkidanovAlex Aug 15, 2020 •

edited

Loading

SkidanovAlex commented Aug 15, 2020 •

edited

Loading

codecov bot commented Aug 15, 2020 •

edited

Loading

bowenwang1996 left a comment

fix: Properly handling stalls due to 1-height differences #3175

fix: Properly handling stalls due to 1-height differences #3175

Conversation

SkidanovAlex commented Aug 15, 2020

Test plan:

gitpod-io bot commented Aug 15, 2020 • edited Loading

SkidanovAlex left a comment

Choose a reason for hiding this comment

SkidanovAlex Aug 15, 2020

Choose a reason for hiding this comment

SkidanovAlex Aug 15, 2020 • edited Loading

Choose a reason for hiding this comment

SkidanovAlex Aug 15, 2020

Choose a reason for hiding this comment

bowenwang1996 left a comment

Choose a reason for hiding this comment

bowenwang1996 Aug 15, 2020

Choose a reason for hiding this comment

SkidanovAlex Aug 15, 2020 • edited Loading

Choose a reason for hiding this comment

SkidanovAlex commented Aug 15, 2020 • edited Loading

codecov bot commented Aug 15, 2020 • edited Loading

Codecov Report

bowenwang1996 left a comment

Choose a reason for hiding this comment

gitpod-io bot commented Aug 15, 2020 •

edited

Loading

SkidanovAlex Aug 15, 2020 •

edited

Loading

SkidanovAlex Aug 15, 2020 •

edited

Loading

SkidanovAlex commented Aug 15, 2020 •

edited

Loading

codecov bot commented Aug 15, 2020 •

edited

Loading