Istanbul BFT's design cannot successfully tolerate fail-stop failures #305

sifmelcara · 2018-03-05T10:49:04Z

After studying Istanbul BFT's specification and implementation, we strongly suspect that the system fails to have the desirable liveness property.

Basically, in the current Istanbul protocol, if different nodes experience different transmission delays,
then it may lead to the situation where different nodes lock on different blocks.
Under such a situation, since the current Istanbul protocol has no proper unlocking mechanism, there are cases where the system will never have +2/3 nodes reaching consensus on the same height.

In particular, the consensus process can get stuck in a useless loop under certain scenarios in which the faulty nodes all do a fail-stop.

The main reason is as follows.

Two honest nodes can lock on different blocks
After two honest nodes lock on different blocks, they may never unlock.
Because of (2.), the whole system can get stuck.

Explanation of 1.

Step 1: Proposer propose block A.
Step 2: One node of the honest nodes, called p, receives 2F+1 "PREPARE", while other honest nodes receives < 2F+1 "PREPARE". (so p locks on A)
Step 3: All honest nodes timeout and goto Round change.
Step 4: Proposer of the next round proposes block B.
Step 5: Another honest node (other than p) receives 2F+1 "PREPARE" and lock on B, while other honest nodes timeout.

Explanation of 2.

Step 1: F faulty nodes simply do a fail-stop.
Step 2: In Block locking mechanism (ethereum/EIPs#650) , Istanbul PBFT mentions a rule:

when a validator is locked, it can only vote on B for height H.

Thus, we will never have 2F+1 nodes sends "PREPARE" vote for a same block.
The system enters a cycle of new round -> pre-preared -> prepared -> round change -> new round.

The text was updated successfully, but these errors were encountered:

sifmelcara · 2018-03-05T10:51:55Z

cc Istanbul BFT's author, @yutelin

vietlq · 2018-12-01T20:59:38Z

Any development here? I’m quite curious

sifmelcara · 2018-12-03T14:31:17Z

Probably no 😞
ethereum/EIPs#650 here is the algorithm's spec I followed, and looks like they still have not fix this issue yet.

csfl · 2018-12-13T13:21:56Z

Probably no 😞
ethereum/EIPs#650 here is the algorithm's spec I followed, and looks like they still have not fix this issue yet.

Hello, do you have any plans to make it work? I find the same issue

sifmelcara · 2018-12-15T16:54:00Z

The only way I know is to add some mechanisms to "unlock" the lock, but that will make the Istanbul BFT basically become Tendermint.

IMHO, Istanbul BFT is just an incorrect variant of Tendermint. Basically the only difference between Tendermint and Istanbul BFT is that Istanbul BFT do not have "Proof of Lock Change"/"Polka" thus it have the possibility of deadlock.

osouza-de · 2019-05-15T00:38:37Z

Any update on this issue?

sifmelcara · 2019-05-15T07:03:15Z

@saltiniroberto published Correctness Analysis of IBFT which also points out this liveness issue. Two possible directions to fix the issue (without proof) are mentioned in section 6.2.1 and 6.2.2. You can take a look if you are interested.

drequinox · 2019-05-15T11:10:54Z

@D4nd @sifmelcara please would you be able to provide details about your network setup where this issue occurred? Also when you experienced this issue, do you have any log files by any chance from that time indicating that nodes experienced this particular issue.
we are working on a correctness proof of the algorithm and if any changes are required, we will implement those.

Ywmet · 2019-05-30T06:34:14Z

@sifmelcara

After studying Istanbul BFT's specification and implementation, we strongly suspect that the system fails to have the desirable liveness property.

Basically, in the current Istanbul protocol, if different nodes experience different transmission delays,
then it may lead to the situation where different nodes lock on different blocks.
Under such a situation, since the current Istanbul protocol has no proper unlocking mechanism, there are cases where the system will never have +2/3 nodes reaching consensus on the same height.

In particular, the consensus process can get stuck in a useless loop under certain scenarios in which the faulty nodes all do a fail-stop.

The main reason is as follows.

Two honest nodes can lock on different blocks

After two honest nodes lock on different blocks, they may never unlock.

Because of (2.), the whole system can get stuck.

Explanation of 1.

Step 1: Proposer propose block A.
Step 2: One node of the honest nodes, called p, receives 2F+1 "PREPARE", while other honest nodes receives < 2F+1 "PREPARE". (so p locks on A)
Step 3: All honest nodes timeout and goto Round change.
Step 4: Proposer of the next round proposes block B.
Step 5: Another honest node (other than p) receives 2F+1 "PREPARE" and lock on B, while other honest nodes timeout.

Explanation of 2.

Step 1: F faulty nodes simply do a fail-stop.
Step 2: In Block locking mechanism (ethereum/EIPs#650) , Istanbul PBFT mentions a rule:

when a validator is locked, it can only vote on B for height H.

Thus, we will never have 2F+1 nodes sends "PREPARE" vote for a same block.
The system enters a cycle of new round -> pre-preared -> prepared -> round change -> new round.

If have 4 nodes A B C D
1st round A Lock on block x
2nd round B lock on block y.
3rd round and 4th round node C and node D won't have 2/3 vote for prepare and timer trigger the round change.
But round change in 5th, A will unlock 1 round propose and propose the block x in the same height.
I think will not happy what you say.

is that right? @jbhurat

sifmelcara · 2019-05-31T09:50:05Z

@Ywmet yes I misunderstand the meaning of the rule mentioned in ethereum/EIPs#650: Unlock: A validator is unlocked at height H and block B when it fails to insert block B to blockchain.

So, actually this make things even worse: They don't satisfy both safety and liveness properties.

For further details and proof of no safety and no liveness, please see Correctness Analysis of IBFT

drequinox · 2019-05-31T13:36:22Z

Note that after step 3 when honest nodes timeout and trigger a round change, the node that already is locked with a hash will also receive round change message and will trigger a round change. if it receives 2F + 1 round change messages it will trigger a round change. If it already has a locked hash it will re-propose that only if it is also the new proposer, it will check if something is pending in the current view and will resend that as a pre-prepare message. Note that due to round robin proposer selection , this could be false and it will simply update the round change timer and become part of the new round change process.

Also note that If the validator has failed and do not recover / restart then this node is out of the network as if it does not exist. It does not matter if it had anything locked or not, If it turns byzantine and do not send a round change message then it simply means that this validator did not receive a round change from others and rest of the network will continue to operate being valid BFT (3F+1 so far) and this validator will be treated as Byzantine and simply means that we consider this out of the network / byzantine and network will tolerate this one faulty node.

Even before step 3, note that we are implying that ALL other validators except one has managed to receive 2F + 1 and other have not, this is an extreme scenario and imagine in a 4 validator network , we are saying that 75% of the network has partitioned, how likely is that scenario?

sifmelcara · 2019-05-31T15:06:29Z

@drequinox Yes, what you said looks correct. But I don't understand the reason why you point out these facts. Do these facts prove/disprove safety or liveness?

this is an extreme scenario and imagine in a 4 validator network , we are saying that 75% of the network has partitioned, how likely is that scenario?

In fact this is very likely to happen during GST. Imagine you set the time-out parameter as 0.5 second to have good latency/throughput during normal operation, but then someday someone DoS your nodes and make average transmission delay become 1 second.

drequinox · 2019-05-31T17:53:42Z

@drequinox Yes, what you said looks correct. But I don't understand the reason why you point out these facts. Do these facts prove/disprove safety or liveness?

Even if we assume this unlikely scenario, after step 3 the 'locked node' will also participate in round change, therefore the network will recover. There are reasonable assumptions that need to be made in order to provide safety and liveness.

In fact this is very likely to happen during GST. Imagine you set the time-out parameter as 0.5 second to have good latency/throughput during normal operation, but then someday someone DoS your nodes and make average transmission delay become 1 second.

Partial synchrony reasonably assumes that the system eventually has adequate time periods during which the algorithms can terminate. If we are assuming a totally unreliable communication (75% of the network is down!) then it is as bad as an asynchronous one and even FLP impossibility could apply in that case.
Quorum networks are permissioned, hence a DoS won't be a concern but I do see your concern for public networks. Also, minimum block time is 1 second but most use cases have a higher number based on network topology.

Ywmet · 2019-06-05T07:59:39Z

If so，this issue can close？

sifmelcara · 2019-06-05T08:25:19Z

Sorry I forgot to reply.

Even if we assume this unlikely scenario, after step 3 the 'locked node' will also participate in round change, therefore the network will recover.

'locked node' participate in round change isn't enough to prove safety and liveness property.

If we are assuming a totally unreliable communication (75% of the network is down!) then it is as bad as an asynchronous one and even FLP impossibility could apply in that case.

I'm not assuming unreliable communication. My point is that if specific sequence of message got lost for some reason, the Blockchain can fork (different nodes decide on different block) and cannot recover. This is totally not ok.

Edit:
To be accurate, by saying "message got lost", I means message delay is so big that the node cannot receive it in time.

drequinox · 2019-06-06T08:33:38Z

'locked node' participate in round change isn't enough to prove safety and liveness property.

Round change provides liveness due to a number of techniques it uses. If a locked node receives the round change message, it will check if there is already a locked hash, if there is, it will repropose that. This is also only going to happen if the validator is also next proposer after locking the previous hash. note that with round-robin this is an unlikely scenario to occur. now imagine the validator has a locked block and it has received the round change message and it is not the new proposer then it will update the round change timer and wait for further messages in the protocol, now even then if it times out at this stage, it will still then request a new round change which will put this validator in the round change process again.

I'm not assuming unreliable communication. My point is that if specific sequence of message got lost for some reason, the Blockchain can fork (different nodes decide on different block) and cannot recover. This is totally not ok.

If at any point, more than 75% of the network is partitioned, then the problem is not with the consensus protocol, in that case, the problem is unreliable communication. however, I see your point regarding If 'a specific set of messages' is lost or delayed. In that case, then the timeouts and round change mechanism provide enough resiliency to tolerate such scenarios and provide safety. Also, note that BFT assumes partial synchrony, which indeed is a reasonable practical assumption.

sifmelcara · 2019-06-06T08:55:06Z

If a locked node receives the round change message, it will check if there is already a locked hash, if there is, it will repropose that. This is also only going to happen if the validator is also next proposer after locking the previous hash. note that with round-robin this is an unlikely scenario to occur. now imagine the validator has a locked block and it has received the round change message and it is not the new proposer then it will update the round change timer and wait for further messages in the protocol, now even then if it times out at this stage, it will still then request a new round change which will put this validator in the round change process again.

Yes the nodes will keep sending messages to each other, but does this have anything to do with liveness? By saying "liveness", I'm referring to the property that eventually all nodes finalize a block.

In that case, then the timeouts and round change mechanism provide enough resiliency to tolerate such scenarios and provide safety.

This is the case for PBFT and Tendermint, but not Istanbul BFT. For detailed proof of violation of safety and liveness property, please refer to the paper I mention in previous replies.

drequinox · 2019-06-06T17:01:39Z

Yes the nodes will keep sending messages to each other, but does this have anything to do with liveness? By saying "liveness", I'm referring to the property that eventually all nodes finalize a block.

All nodes will eventually finalize a block under partial synchrony model. even if for some reason a node is unable to insert a block the old block will be synchronized later. This synchronization is also subject to IBFT validation checks. Therefore, unless the communication is severely hampered there cannot be a scenario where a node is 'out of synch' with other nodes.

This is the case for PBFT and Tendermint, but not Istanbul BFT. For detailed proof of violation of safety and liveness property, please refer to the paper I mention in previous replies.

Under correct partial synchrony assumptions, IBFT does not violate safety or liveness properties. The original scenario (#305 (comment)) mentioned in this issue has been analyzed and there is no credible evidence found that this issue can occur.
We are continuously analysing IBFT and trying to improve it further where required. It is also worth noting that in Quorum we target private consortium blockchain networks where the attack model is quite different from public networks.

sifmelcara · 2019-06-06T17:15:16Z

Yes that scenario actually cannot happen because I misinterpret (I accidentally fixes safety issue for them...) the rule mentioned in ethereum/EIPs#650: Unlock: A validator is unlocked at height H and block B when it fails to insert block B to blockchain.. Unlocks when fails to insert block B to blockchain sounds so absurd because normally consensus algorithm only unlock when there exists some evidence that no one have decided on a block.

Under correct partial synchrony assumptions, IBFT does not violate safety or liveness properties.

No, it violates safety and liveness properties.

We are continuously analysing IBFT

You don't have to analyse IBFT because there are already counter-example of safety property and counter-example of liveness in the paper I mentioned in previous comments. If the content of that paper looks wrong to you, please point it out so we can discuss.

drequinox · 2019-06-06T18:43:27Z

We continuously review all aspects of Quorum to enhance them.
In our ongoing analysis, we saw a similar analysis done on the Tendermint consensus algorithm. Given that some features in IBFT were inspired by the same design as Tendermint, beside Clique and PBFT, we are working on analyzing the impact of the below papers findings as they are more practical and have some overlap with the findings of the IBFT analysis paper.
https://hal.archives-ouvertes.fr/hal-01881212/document
https://arxiv.org/pdf/1807.04938.pdf

prototypo · 2019-06-07T00:23:29Z

drequinox,

You simply cannot refute an issue raised in math by arguing with words. There are only two ways to proceed from here if you wish to refute the safety and liveness concerns for IBFT:

Challenge the assumptions made in the paper above, or
Challenge the analysis.

It would be more helpful if you could explain why you don't believe the analysis holds in Quorum's implementation.

jpmsam · 2019-06-07T02:05:10Z

I'd like to remind everyone that Quorum is an evolving platform in that we are constantly working on improving every part of the system including consensus algorithms. To that end, we do listen to comments from our users, public papers, and all other sources of feedback. This certainly includes Istanbul BFT implementation.

While we do appreciate all of the feedback, we have prioritized to date reported bugs we need to address in the near term. Based on the multiple papers we referenced and the answers we provided, it’s clear that we have been investing time in reviewing the IBFT implementation. Most likely, based on the review of all the sources, theoretical limitation in this case exist in IBFT but to date such occurrences haven’t been reported. We have been also analyzing the other protocols that inspired the algorithm to determine some of the limitations in IBFT and we will address them as practically as possible.

We welcome all pull requests addressing any issues. Further theoretical discussion on the topic is considered unhelpful as the issue and our standing on it has been discussed previously.

jpmsam · 2020-06-10T20:27:08Z

@sifmelcara we have published an updated design for IBFT in Quorum by @hmoniz and we have made the code associated with the updated design available under a Quorum branch. The implementation is still WIP but we would appreciate any feedback on the updated algorithm.

vncoelho · 2020-10-27T21:24:41Z

We faced a similar issue on NEO Blockchain and solved that on dBFT 2.0.
You may take a look at our recent paper: "Challenges of PBFT-Inspired Consensus for Blockchain and Enhancements over Neo dBFT " available at https://www.mdpi.com/1999-5903/12/8/129

Recently we also developed a mixed-integer linear programming model in order to verify liveness on such cases: "A MILP Model for a Byzantine Fault Tolerant Blockchain Consensus" available at https://www.mdpi.com/1999-5903/12/11/185

fixanoid assigned jbhurat Dec 4, 2018

fixanoid mentioned this issue Dec 4, 2018

Istanbul: bigger blockperiod value cause blockchain stuck #583

Closed

Krish1979 added the in progress label Feb 18, 2019

nambrot mentioned this issue Mar 5, 2019

Devs SBAT be sure of Istanbul liveness celo-org/celo-blockchain#116

Closed

Krish1979 added this to the QE - April Deliverable milestone Mar 15, 2019

fixanoid removed this from the QE - April Deliverable milestone Dec 4, 2019

jpmsam unassigned jbhurat Aug 25, 2020

jackma-cloud mentioned this issue Dec 28, 2020

lock&unlock：Some doubts tendermint/tendermint#5834

Closed

baptiste-b-pegasys removed the in progress label Sep 13, 2022

Mameta29 mentioned this issue Oct 4, 2023

The Round Change continues indefinitely, BUT no new block is being generated. Consensys/quorum-kubernetes#207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Istanbul BFT's design cannot successfully tolerate fail-stop failures #305

Istanbul BFT's design cannot successfully tolerate fail-stop failures #305

sifmelcara commented Mar 5, 2018 •

edited by Krish1979

Loading

sifmelcara commented Mar 5, 2018

vietlq commented Dec 1, 2018

sifmelcara commented Dec 3, 2018

csfl commented Dec 13, 2018

sifmelcara commented Dec 15, 2018

osouza-de commented May 15, 2019

sifmelcara commented May 15, 2019

drequinox commented May 15, 2019 •

edited

Loading

Ywmet commented May 30, 2019

Explanation of 1.

Explanation of 2.

sifmelcara commented May 31, 2019

drequinox commented May 31, 2019 •

edited

Loading

sifmelcara commented May 31, 2019

drequinox commented May 31, 2019

Ywmet commented Jun 5, 2019

sifmelcara commented Jun 5, 2019 •

edited

Loading

drequinox commented Jun 6, 2019

sifmelcara commented Jun 6, 2019

drequinox commented Jun 6, 2019

sifmelcara commented Jun 6, 2019

drequinox commented Jun 6, 2019

prototypo commented Jun 7, 2019 •

edited

Loading

jpmsam commented Jun 7, 2019

jpmsam commented Jun 10, 2020 •

edited

Loading

vncoelho commented Oct 27, 2020 •

edited

Loading

Istanbul BFT's design cannot successfully tolerate fail-stop failures #305

Istanbul BFT's design cannot successfully tolerate fail-stop failures #305

Comments

sifmelcara commented Mar 5, 2018 • edited by Krish1979 Loading

Explanation of 1.

Explanation of 2.

sifmelcara commented Mar 5, 2018

vietlq commented Dec 1, 2018

sifmelcara commented Dec 3, 2018

csfl commented Dec 13, 2018

sifmelcara commented Dec 15, 2018

osouza-de commented May 15, 2019

sifmelcara commented May 15, 2019

drequinox commented May 15, 2019 • edited Loading

Ywmet commented May 30, 2019

Explanation of 1.

Explanation of 2.

sifmelcara commented May 31, 2019

drequinox commented May 31, 2019 • edited Loading

sifmelcara commented May 31, 2019

drequinox commented May 31, 2019

Ywmet commented Jun 5, 2019

sifmelcara commented Jun 5, 2019 • edited Loading

drequinox commented Jun 6, 2019

sifmelcara commented Jun 6, 2019

drequinox commented Jun 6, 2019

sifmelcara commented Jun 6, 2019

drequinox commented Jun 6, 2019

prototypo commented Jun 7, 2019 • edited Loading

jpmsam commented Jun 7, 2019

jpmsam commented Jun 10, 2020 • edited Loading

vncoelho commented Oct 27, 2020 • edited Loading

sifmelcara commented Mar 5, 2018 •

edited by Krish1979

Loading

drequinox commented May 15, 2019 •

edited

Loading

drequinox commented May 31, 2019 •

edited

Loading

sifmelcara commented Jun 5, 2019 •

edited

Loading

prototypo commented Jun 7, 2019 •

edited

Loading

jpmsam commented Jun 10, 2020 •

edited

Loading

vncoelho commented Oct 27, 2020 •

edited

Loading