Devs SBAT be sure of Istanbul liveness #116

timmoreton · 2019-02-26T20:14:42Z

Istanbul has a liveness issue described in:
https://arxiv.org/pdf/1901.07160.pdf
https://drive.google.com/open?id=1v5kOvu5eppGHXdHqHoCIUMVmV7eAsDwpLljDtvVBR_U

and also described here:
ethereum/EIPs#650 (comment)

Istanbul learns of new blocks via miner/worker.go line 428, which results in a new FinalCommittedEvent being posted to the engine, handled at backend/handler.go line 123. Currently this just starts a new round.

timmoreton · 2019-02-26T20:18:57Z

I believe (but haven't checked) that the Pantheon Istanbul implementation fixes this -- see here for more info: https://github.com/PegaSysEng/pantheon/tree/master/consensus/ibft

nambrot · 2019-03-05T20:01:19Z

The liveness issue seems to be as following.

Assume n validators, with Quorum(n) as the byzantine fault threshold.
Assume n-1 honest validators, and one byzantine validator F (doesn't even have to be malicious, just
Set W contains Quorum(n)-1 honest validators and the byzantine validator
Set V contains all the other honest validators, the size of V is by definition strictly smaller than Quorum(n)

The scenario is as following:

A valid proposer proposes Block B1 for height h, sends the PREPREPARE and gets everyone to send a PREPARE based upon that
Only the nodes in V receive QUORUM(n) PREPARE, and will thus lock onto B1 for height h, the nodes in W never receive enough PREPAREs for the lock
Everyone eventually will issue a ROUND-CHANGE after the timeout
Byzantine node F will propose block B2 (it never locked on B1)
All nodes in W will PREPARE on B2 and with F will achieve QUORUM(n) and lock onto B2
All nodes in V will issue ROUND-CHANGE as they have already locked on B1

No progress can be made since no one will unlock

nambrot · 2019-03-05T21:13:37Z

Also interesting comment openethereum/parity-ethereum#9298 (comment) and Consensys/quorum#305

nambrot · 2019-03-05T21:55:55Z

My understanding of the original locking mechanism in IBFT is to ensure the safety during round changes. Otherwise byzantine nodes can send multiple PREPARE and COMMITs to different subsets of honest nodes which will then cause honest nodes to agree on different blocks at the same height, although I'm not entirely sure how, if honest nodes only ever send out one PREPARE

It seems to me that the locking mechanism is inherently impossible to remedy with liveness, an example being Consensys/quorum#305.

The solution above as well as the second on in the correctness analysis paper is to basically implement what tendermint has, a proof that it is safe for the nodes to unlock.

The first solution in the paper is get rid of locking completely apparently, but not sure I completely understand it. To quote:

The key change to IBFT is to remove the locking mechanism and ensure safety between round changes via a round change sub-protocol similar to the one in the PBFT protocol

nambrot · 2019-03-06T23:42:19Z

I believe this is the equivalent liveness PR in pantheon PegaSysEng/pantheon#872

trianglesphere · 2019-03-09T00:53:01Z

As noted in the arxiv paper (section 6.2), there are two solutions to the liveness problem proposed in the paper.

1. Remove the locking and use a PBFT like round change.

Per this article IBFT 2.0 from PegaSys is going to use the PBFT style mechanism.

The round change modification addresses the liveness issues in IBFT 1.0. In IBFT 1.0, a round change always results in new block proposals. In IBFT 2.0, an enhanced round change protocol ensures that if any validator commits in a round, then the block proposed in any successive round matches the committed block. This mechanism, inspired by the PBFT view change protocol, removes the need for the block locking mechanism used in IBFT 1.0 which caused liveness issues.

Removing the locking mechanism solves the problem of getting a split locking situation, but it removes the safety guarantee.

IBFT 2.0 reintroduces the safety guarantee by introducing the following requirements on round changes. (PBFT view changes are more general, see section 4.4. of PBFT paper).

Each round change has a prepared certificate which includes proof that a block has 2n/3 valid prepare messages for it.
When the next proposer proposes a block, it must include a pre-prepare message with the new round message.
The pre-prepare message from the new proposer must propose the highest block that has a valid prepare certificate. If no such certificate exists, the new proposer can propose any block that it would like to.

In a PBFT context with multiple actions in a single view (analogous to a round), the new proposer needs to use prepare certificates rather than commit certificates (proof that a node has commited an action) because it appears possible for the new proposer to not get a commit certificate and thus safety is lost if a node has locally committed, but the proposer won't. Here is a good explanation of the view change in PBFT and why it needs the prepare certificates.

Looking into PegaSys's source code. When a message is broadcast, it can either be in round 0, or a later round. If my understanding is correct that at each block height a new instance of IBFT block finalization is started, then round starts off at 0 for each new block. Here is PegaSys's block validating for ibft2.0. This thing to pay attention to is validateProposalAndRoundChangeAreConsistent which requires no round change certificate in the case of the first round or having a valid round change cert in the case of actually changing a round. IBFT 2.0 may have relaxed prepare certificate to commit certificate, but I'm not quite sure yet.

2. Allow relocking on newer round like tendermint does.

The lock causes problems because some nodes end up locked forever because of network partitioning.

The reason that tendermint has locking and proof of lock change (polka) is explained here (thesis on tendermint) on page 23. I've just quoted relevant sections that explain why the lock and proof of lock change is required.

Explanation of locking

These rules can be understood more intuitively by way of examples. Consider four validators, A, B, C, D, and suppose there is a proposal for blockX at round R. Suppose there is a polka for blockX, but A doesn’t see it, and pre-commits nil, while the others pre-commit for blockX. Now suppose the only one to see all pre-commits is D, while the others, say, don’t see D’s pre-commit (they only see their two pre-commits and A’s pre-commit nil). D will now commit the block, while the others go to round R + 1. Since any of the validators might be the new proposer, if they can propose and vote for any new block, say blockY , then they might commit it and compromise safety, since D already committed blockX. Note that there isn’t even any Byzantine behaviour here, just asynchrony! Locking solves the problem by forcing validators to stick with the block they pre-committed, since other validators might have committed based on those pre-commits (as D did in this example). In essence, once more than two-thirds pre-commit a block in a round, the network is locked on that block, which is to say it must be impossible to produce a valid polka for a different block at a higher round. This is direct motivation for Prevote-the-Lock.

Explanation of why polka needs to exist

Prevote-the-Lock is not sufficient, however. There must be a way to unlock, lest we sacrifice liveness. Consider a round where A and B precommitted blockX while C and D pre-committed nil - a split vote. They all move to the next round, and blockY is proposed, which C and D prevote for. Suppose A is Byzantine, and prevotes for blockY as well (despite being locked on blockX), resulting in a polka. Suppose B does not see the polka and pre-commits nil, while A goes off-line and C and D pre-commit blockY . They move to the next round, but B is still locked on blockX, while C and D are now locked on blockY , and since A is offline, they can never get a polka. Hence, we’ve compromised liveness with less than a third (here, only one) Byzantine validators. The obvious justification for unlocking is a polka. Once B sees the polka for blockY (which C and D used to jusitfy their pre-commits for blockY ), it ought to be able to unlock, and hence pre-commit blockY . This is the motivation for Unlock-on-Polka, which allows validators to unlock (and precommit a new block), if they have seen a polka in a round greater than that in which they locked.

trianglesphere · 2019-03-21T17:57:41Z

I've started a small test to show how istanbul can lock up. The test doesn't run, but it has a comment explaining how to set up the test. The branch is trianglesphere/istanbul-liveness-failure

nambrot · 2019-04-23T22:08:01Z

Fun fact, quorum put Consensys/quorum#305 on their april deliverable

nambrot · 2019-07-19T15:27:58Z

There was some additional conversation on Consensys/quorum#305 with the author of the correctness analysis

timmoreton added the geth label Feb 26, 2019

cmcewen assigned nambrot Mar 6, 2019

trianglesphere self-assigned this Mar 13, 2019

timmoreton assigned timmoreton and unassigned trianglesphere Apr 24, 2019

timmoreton added the protocol label May 8, 2019

timmoreton assigned asaj and unassigned timmoreton and nambrot May 9, 2019

timmoreton added the istanbul label May 9, 2019

asaj mentioned this issue May 20, 2019

Add certificates to Istanbul BFT messages for liveness #200

Closed

asaj added the audit label Jun 18, 2019

asaj removed their assignment Jun 19, 2019

timmoreton assigned trianglesphere Jul 24, 2019

trianglesphere mentioned this issue Jul 30, 2019

Adds Prepared Certificates to ensure Istanbul liveness #366

Merged

asaj unassigned trianglesphere Sep 11, 2019

asaj assigned kevjue Sep 11, 2019

timmoreton added jumbo labels Sep 13, 2019

asaj unassigned kevjue Sep 18, 2019

asaj self-assigned this Sep 18, 2019

asaj closed this as completed in #366 Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Devs SBAT be sure of Istanbul liveness #116

Devs SBAT be sure of Istanbul liveness #116

timmoreton commented Feb 26, 2019

timmoreton commented Feb 26, 2019

nambrot commented Mar 5, 2019

nambrot commented Mar 5, 2019 •

edited

Loading

nambrot commented Mar 5, 2019 •

edited

Loading

nambrot commented Mar 6, 2019 •

edited

Loading

trianglesphere commented Mar 9, 2019 •

edited

Loading

trianglesphere commented Mar 21, 2019

nambrot commented Apr 23, 2019

nambrot commented Jul 19, 2019 •

edited

Loading

Devs SBAT be sure of Istanbul liveness #116

Devs SBAT be sure of Istanbul liveness #116

Comments

timmoreton commented Feb 26, 2019

timmoreton commented Feb 26, 2019

nambrot commented Mar 5, 2019

nambrot commented Mar 5, 2019 • edited Loading

nambrot commented Mar 5, 2019 • edited Loading

nambrot commented Mar 6, 2019 • edited Loading

trianglesphere commented Mar 9, 2019 • edited Loading

1. Remove the locking and use a PBFT like round change.

2. Allow relocking on newer round like tendermint does.

Explanation of locking

Explanation of why polka needs to exist

trianglesphere commented Mar 21, 2019

nambrot commented Apr 23, 2019

nambrot commented Jul 19, 2019 • edited Loading

nambrot commented Mar 5, 2019 •

edited

Loading

nambrot commented Mar 5, 2019 •

edited

Loading

nambrot commented Mar 6, 2019 •

edited

Loading

trianglesphere commented Mar 9, 2019 •

edited

Loading

nambrot commented Jul 19, 2019 •

edited

Loading