Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(state_sync_massive): Epoch length was too short #3208

Merged
merged 11 commits into from
Sep 25, 2020

Conversation

mfornet
Copy link
Member

@mfornet mfornet commented Aug 18, 2020

Epoch length was too short (20) and observer when tried
to sync, used old epoch, but right now, trying to sync from
any epoch other than current epoch or previous epoch is considered
malicious behavior.

Fixes #3130

Test plan

state_sync_massive.py passes in nightly

@gitpod-io
Copy link

gitpod-io bot commented Aug 18, 2020

@mfornet
Copy link
Member Author

mfornet commented Aug 18, 2020

Waiting for nightly test to pass:
http://nayduck.eastus.cloudapp.azure.com:3000/#/test/33015

@bowenwang1996
Copy link
Collaborator

@mfornet looks like it failed

@SkidanovAlex
Copy link
Collaborator

@mfornet you also do not trigger state sync right now I think, I believe it needs to be at least two epochs in today for the observer to actually trigger the state sync (so you probably need to change 101 and 201 to 201 and 301 correspondingly).
I would assert using the log tracker that the state sync was indeed triggered, the same way that we do in state_sync.py

@codecov
Copy link

codecov bot commented Aug 19, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@f85e142). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #3208   +/-   ##
=========================================
  Coverage          ?   87.71%           
=========================================
  Files             ?      217           
  Lines             ?    44113           
  Branches          ?        0           
=========================================
  Hits              ?    38695           
  Misses            ?     5418           
  Partials          ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f85e142...367d38e. Read the comment docs.

Copy link
Collaborator

@SkidanovAlex SkidanovAlex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given it's all green to be pushed, requesting changes until it actually passes in nayduck to prevent accidental merge-in.

Epoch length was too short (20) and observer when tried
to sync, used old epoch, but right now, trying to sync from
any epoch other than current epoch or previous epoch is considered
malicious behavior.
@mfornet mfornet force-pushed the massive_state_sync_increase_epoch branch from d295c80 to 1646112 Compare September 2, 2020 17:14
@mfornet mfornet force-pushed the massive_state_sync_increase_epoch branch from 41f31ca to 19b5a79 Compare September 22, 2020 06:31
@mfornet
Copy link
Member Author

mfornet commented Sep 22, 2020

This test is performing really poorly on nayduck while passing locally. I've need to adjust few times the size of the epoch, since the protocol relies in the assumption that the node will be able to sync in less than one epoch.

node0 logs
node1 logs

Most notably on nayduck, nodes produced 115 blocks in 20 minutes which is a sign that something is going really wrong. I see in logs from node0 that there are a lot of messages being dropped (which is very weird in this setup with only 2 nodes). However this should not be an issue anymore after #3088 lands.

What puzzles me is that it works every single time locally while failing on nayduck. Probably the reason is a combination of (low performant machine + small epoch length + large state). Notice that in the run above, the node2 didn't even started, so the issue is not with the new node syncing.

@SkidanovAlex @bowenwang1996 are you aware of any protocol mechanism that heavily depends on the epoch length being large enough, otherwise defaulting to significant slowdown?

@bowenwang1996
Copy link
Collaborator

@mfornet can you disable storage check in the test? Looks like it is causing quite a bit of slowdown while the node is running

@mfornet
Copy link
Member Author

mfornet commented Sep 22, 2020

@mfornet can you disable storage check in the test? Looks like it is causing quite a bit of slowdown while the node is running

It passed nayduck after removing storage_check 💯
http://nayduck.eastus.cloudapp.azure.com:3000/#/test/33017

@SkidanovAlex we can unblock and merge this PR already

@bowenwang1996
Copy link
Collaborator

It passed nayduck after removing storage_check 💯

Epic 🚀

@SkidanovAlex
Copy link
Collaborator

Can we add the assert that the node actually state_synced (not block synced)? Like at the end of state_sync.py:

assert tracker.check("transition to State Sync")

@mfornet
Copy link
Member Author

mfornet commented Sep 25, 2020

Can we add the assert that the node actually state_synced (not block synced)? Like at the end of state_sync.py:

assert tracker.check("transition to State Sync")

http://nayduck.eastus.cloudapp.azure.com:3000/#/run/401

@nearprotocol-bulldozer nearprotocol-bulldozer bot merged commit 48cc4f6 into master Sep 25, 2020
@nearprotocol-bulldozer nearprotocol-bulldozer bot deleted the massive_state_sync_increase_epoch branch September 25, 2020 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

state_sync_massive.py fails with timeout
3 participants