-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG REPORT] Pocket nodes getting stuck during sync #1478
Comments
@oten91 Do you have any insight on:
|
1- Hey , Will try to summarize what I observed yesterday and some anecdotal information.
2- Avoiding is complex subject, some things happening on the network have to do more with misconfiguration and not fully understanding the shared cost of actions by individual node runners. The current reality is that the existence of LeanPOKT and the careless use of it, makes it even harder for us to put barriers to avoid this kind of scenarios. In the Past this type of issues were isolated to a small set of nodes as each node was its own process, In our new reality, as people are stacking nodes that share the same data (p2p/evidence/state) we are far more exposed to this kind of issues. Education is a good measure but Harsher on-chain penalties may work better on the long run. Whats is worrying is that these events happened in the past but the scale and frequency at which these events are happening now is making the matter more serious. The previous report of this happening was at height 71273. 3- The tested and true "Restore from a backup" is the only recommended path. Also the stuck nodes having the mempool full is expected as their p2p layer is not fully aware that consensus has stopped due to the panic recovery handling.
|
@oten91 I assume by "The current reality is that the existence of LeanPOKT and the careless use of it, makes it even harder for us to put barriers to avoid this kind of scenarios." - you mean that some node runners are using LeanPOKT to run a lot of validators on the same machine. Is that correct? Also, is it possible to identify the validators that suddenly went offline? |
@oten91 Thank you for the detailed answer and explanation. I have a few general questions to increase my personal understanding as someone who has spent less time working with the cosmo's implementation of tendermint BFT.
Q1: Does this mean the data dir gets irreversibly corrupted by uncommitted blocks? Q2: Is this a result of the fact that our Tendermint fork is 2 years old or is it still the case today in the Cosmos SDK today?
I read through Tendermint BFT Consensus Algorithm and am trying to understand what happened here. There were sufficient votes being broadcast throughout the network, and +2/3 (enough to maintain liveness) voted in the same way, so why did that not becoming the "canonical chain"? Q3: My understanding is that it is this point/reason (see the image below). Is this correct or am I missing something?
+1 to education. However, personally, I believe that slashing is the simple, low-hanging, "crypto" approach to this. It'll obviously be up to the DAO to decide. It's worth noting that this problem expands outside of just Pocket and LeanPOKT, across the entire crypto industry. This post summarizes it well IMO:
Yes.
I believe (haven't checked) it should be available via on-chain data if the blockchain continue to make progress while the validator was down, but otherwise, it would require timed inspection of the mempool when the error was happening. Q4: Deferring to @oten91 if I missed something here? We would probably need to increase For reference, here's another reference from the slashing module: |
Answering here :
Sadly yes, the block is not processed by the node but is stored. Modifying the expected apphash, after that happens a rollback is needed for the node to "forget" about the last round information as he was not expecting a block without all the required signatures (+2/3)
In this case for the stuck nodes the majority was wrong and this would have created a fork in other types of consensus , with tendermint "fork-resistant" qualities that should also be present in latest version, the result probably would be the same. One thing I believe is that this may be better handled on the newer versions, as they incorporated a "staging" state that they write to before writing to the actual one for this scenarios ,so it may not leave the db in a bad state even if it halts.
For this particular scenario and the perspective of the nodes that got stuck, the correct one is the last bullet point :
|
@oten91 |
This is a fix for a consensus failure with `wrong Block.Header.AppHash`, which is probably the same issue as #1478. ### Issue description: For a node to be selected for a session, it must be valid not only at the start of a session but also at the end of a session. If a node is edit-staked, jailed, or unjailed in the middle of a session, a servicer set of the session is changed. When a node serves relays, on the other hand, it stores session information into the global session cache. If the session's servicer set is changed, the node needs to update the cached session. Otherwise the node accepts a claim that should be rejected, or rejects a claim that should be accepted. As a result, the node ends up with a consensus failure with `wrong Block.Header.AppHash`. ### Root cause and Fix: The root cause is simple. We have a call to `ClearSessionCache` in `EditStakeValidator`, `LegacyForceValidatorUnstake`, `ForceValidatorUnstake`, and `JailValidator`, but not in `UnjailValidator`. The proposed fix is to add the call in `UnjailValidator` as well.
Describe the bug
The node runner community is reporting stuck nodes. Also, on restart the db is corrupt and will not sync
To Reproduce
Not sure how to repro. Node runners are reporting the chain stopped syncing around block 71479, also 68389
.
Expected behavior
Pocket nodes should stay syncd
Screenshots
Snippets from Pocket Network #node-runners chat:
Operating System or Platform:
Please indicate the platform(s) where the bug is happening.
Additional context
Community thread is here and sprinkled throughout the #node-runners channel September 22nd and 23rd.
This task is tracked internally with tickets T-1441, T-14354
The text was updated successfully, but these errors were encountered: