-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Istanbul BFT's design cannot successfully tolerate fail-stop failures #305
Comments
cc Istanbul BFT's author, @yutelin |
Any development here? I’m quite curious |
Probably no 😞 |
Hello, do you have any plans to make it work? I find the same issue |
The only way I know is to add some mechanisms to "unlock" the lock, but that will make the Istanbul BFT basically become Tendermint. IMHO, Istanbul BFT is just an incorrect variant of Tendermint. Basically the only difference between Tendermint and Istanbul BFT is that Istanbul BFT do not have "Proof of Lock Change"/"Polka" thus it have the possibility of deadlock. |
Any update on this issue? |
@saltiniroberto published Correctness Analysis of IBFT which also points out this liveness issue. Two possible directions to fix the issue (without proof) are mentioned in section 6.2.1 and 6.2.2. You can take a look if you are interested. |
@D4nd @sifmelcara please would you be able to provide details about your network setup where this issue occurred? Also when you experienced this issue, do you have any log files by any chance from that time indicating that nodes experienced this particular issue. |
If have 4 nodes A B C D is that right? @jbhurat |
@Ywmet yes I misunderstand the meaning of the rule mentioned in ethereum/EIPs#650: So, actually this make things even worse: They don't satisfy both safety and liveness properties. For further details and proof of no safety and no liveness, please see Correctness Analysis of IBFT |
Note that after step 3 when honest nodes timeout and trigger a round change, the node that already is locked with a hash will also receive round change message and will trigger a round change. if it receives 2F + 1 round change messages it will trigger a round change. If it already has a locked hash it will re-propose that only if it is also the new proposer, it will check if something is pending in the current view and will resend that as a pre-prepare message. Note that due to round robin proposer selection , this could be false and it will simply update the round change timer and become part of the new round change process. Also note that If the validator has failed and do not recover / restart then this node is out of the network as if it does not exist. It does not matter if it had anything locked or not, If it turns byzantine and do not send a round change message then it simply means that this validator did not receive a round change from others and rest of the network will continue to operate being valid BFT (3F+1 so far) and this validator will be treated as Byzantine and simply means that we consider this out of the network / byzantine and network will tolerate this one faulty node. Even before step 3, note that we are implying that ALL other validators except one has managed to receive 2F + 1 and other have not, this is an extreme scenario and imagine in a 4 validator network , we are saying that 75% of the network has partitioned, how likely is that scenario? |
@drequinox Yes, what you said looks correct. But I don't understand the reason why you point out these facts. Do these facts prove/disprove safety or liveness?
In fact this is very likely to happen during GST. Imagine you set the time-out parameter as 0.5 second to have good latency/throughput during normal operation, but then someday someone DoS your nodes and make average transmission delay become 1 second. |
Even if we assume this unlikely scenario, after step 3 the 'locked node' will also participate in round change, therefore the network will recover. There are reasonable assumptions that need to be made in order to provide safety and liveness.
Partial synchrony reasonably assumes that the system eventually has adequate time periods during which the algorithms can terminate. If we are assuming a totally unreliable communication (75% of the network is down!) then it is as bad as an asynchronous one and even FLP impossibility could apply in that case. |
If so,this issue can close? |
Sorry I forgot to reply.
'locked node' participate in round change isn't enough to prove safety and liveness property.
I'm not assuming unreliable communication. My point is that if specific sequence of message got lost for some reason, the Blockchain can fork (different nodes decide on different block) and cannot recover. This is totally not ok. Edit: |
Round change provides liveness due to a number of techniques it uses. If a locked node receives the round change message, it will check if there is already a locked hash, if there is, it will repropose that. This is also only going to happen if the validator is also next proposer after locking the previous hash. note that with round-robin this is an unlikely scenario to occur. now imagine the validator has a locked block and it has received the round change message and it is not the new proposer then it will update the round change timer and wait for further messages in the protocol, now even then if it times out at this stage, it will still then request a new round change which will put this validator in the round change process again.
If at any point, more than 75% of the network is partitioned, then the problem is not with the consensus protocol, in that case, the problem is unreliable communication. however, I see your point regarding If 'a specific set of messages' is lost or delayed. In that case, then the timeouts and round change mechanism provide enough resiliency to tolerate such scenarios and provide safety. Also, note that BFT assumes partial synchrony, which indeed is a reasonable practical assumption. |
Yes the nodes will keep sending messages to each other, but does this have anything to do with liveness? By saying "liveness", I'm referring to the property that eventually all nodes finalize a block.
This is the case for PBFT and Tendermint, but not Istanbul BFT. For detailed proof of violation of safety and liveness property, please refer to the paper I mention in previous replies. |
All nodes will eventually finalize a block under partial synchrony model. even if for some reason a node is unable to insert a block the old block will be synchronized later. This synchronization is also subject to IBFT validation checks. Therefore, unless the communication is severely hampered there cannot be a scenario where a node is 'out of synch' with other nodes.
Under correct partial synchrony assumptions, IBFT does not violate safety or liveness properties. The original scenario (#305 (comment)) mentioned in this issue has been analyzed and there is no credible evidence found that this issue can occur. |
Yes that scenario actually cannot happen because I misinterpret (I accidentally fixes safety issue for them...) the rule mentioned in ethereum/EIPs#650:
No, it violates safety and liveness properties.
You don't have to analyse IBFT because there are already counter-example of safety property and counter-example of liveness in the paper I mentioned in previous comments. If the content of that paper looks wrong to you, please point it out so we can discuss. |
We continuously review all aspects of Quorum to enhance them. |
drequinox, You simply cannot refute an issue raised in math by arguing with words. There are only two ways to proceed from here if you wish to refute the safety and liveness concerns for IBFT:
It would be more helpful if you could explain why you don't believe the analysis holds in Quorum's implementation. |
I'd like to remind everyone that Quorum is an evolving platform in that we are constantly working on improving every part of the system including consensus algorithms. To that end, we do listen to comments from our users, public papers, and all other sources of feedback. This certainly includes Istanbul BFT implementation. While we do appreciate all of the feedback, we have prioritized to date reported bugs we need to address in the near term. Based on the multiple papers we referenced and the answers we provided, it’s clear that we have been investing time in reviewing the IBFT implementation. Most likely, based on the review of all the sources, theoretical limitation in this case exist in IBFT but to date such occurrences haven’t been reported. We have been also analyzing the other protocols that inspired the algorithm to determine some of the limitations in IBFT and we will address them as practically as possible. We welcome all pull requests addressing any issues. Further theoretical discussion on the topic is considered unhelpful as the issue and our standing on it has been discussed previously. |
@sifmelcara we have published an updated design for IBFT in Quorum by @hmoniz and we have made the code associated with the updated design available under a Quorum branch. The implementation is still WIP but we would appreciate any feedback on the updated algorithm. |
We faced a similar issue on NEO Blockchain and solved that on dBFT 2.0. Recently we also developed a mixed-integer linear programming model in order to verify liveness on such cases: "A MILP Model for a Byzantine Fault Tolerant Blockchain Consensus" available at https://www.mdpi.com/1999-5903/12/11/185 |
After studying Istanbul BFT's specification and implementation, we strongly suspect that the system fails to have the desirable liveness property.
Basically, in the current Istanbul protocol, if different nodes experience different transmission delays,
then it may lead to the situation where different nodes lock on different blocks.
Under such a situation, since the current Istanbul protocol has no proper unlocking mechanism, there are cases where the system will never have +2/3 nodes reaching consensus on the same height.
In particular, the consensus process can get stuck in a useless loop under certain scenarios in which the faulty nodes all do a fail-stop.
The main reason is as follows.
Explanation of 1.
Step 1: Proposer propose block
A
.Step 2: One node of the honest nodes, called
p
, receives2F+1
"PREPARE", while other honest nodes receives< 2F+1
"PREPARE". (sop
locks onA
)Step 3: All honest nodes timeout and goto
Round change
.Step 4: Proposer of the next round proposes block
B
.Step 5: Another honest node (other than
p
) receives2F+1
"PREPARE" and lock onB
, while other honest nodes timeout.Explanation of 2.
Step 1:
F
faulty nodes simply do a fail-stop.Step 2: In Block locking mechanism (ethereum/EIPs#650) , Istanbul PBFT mentions a rule:
Thus, we will never have
2F+1
nodes sends "PREPARE" vote for a same block.The system enters a cycle of
new round -> pre-preared -> prepared -> round change -> new round
.The text was updated successfully, but these errors were encountered: