For much improved transaction volume, latency, and finality, many blockchain systems utilize BFT consensus protocols. In particular, permissioned blockchains enjoy the full benefits of BFT consensus among a small (e.g., a few tens or hundreds) group of distributed nodes. One remaining problem though is the limited scalability of BFT consensus protocols. In particular, the many existing protocols, such as PBFT, Istanbul BFT, have quadratic message complexity to the number of replicas in the system, and this makes the blockchains hard to scale beyond a few tens of nodes participating in consensus. Some newer proposals have attempted to reduce the message complexity via committee selections, threshold signatures, and trusted hardware.
In this proposal, we ask whether the current network infrastructure is properly used for the BFT consensus protocols in permissioned blockchains and investigate if emerging network primitives can further improve the throughput performance. First, we plan to study whether the current Internet infrastructure is properly used for scalable operation of the Istanbul BFT consensus protocol in real large-scale blockchain networks like Klaytn. In particular, we will perform an in-depth evaluation on the effect of the current underlying Internet architecture of the commercial blockchains on their throughput performance. Furthermore, we aim to utilize emerging hardware-based networking primitives, such as programmable switches, to further speed up the message propagation in the Istanbul BFT consensus protocol.
For our KIR milestone #2, we submit this technical report on the liveness property of the IBFT-based Klaytn consensus. We focus on the robustness of general BFT-based blockchain consensus algorithms against a strong network adversary that can cause arbitrary message delays. From the review of theoretical studies of BFT consensus algorithms (e.g., PBFT), we aim to study the known liveness guarantees, or a lack thereof, in multiple variants of BFT based consensus algorithms. We then focus on the IBFT implemented for current Klaytn and provide some qualitative analysis on attack feasibility.
We first emphasize that, in general, liveness property is not guaranteed in most BFT-based consensus protocols. Existing BFT consensus protocols, such as PBFT and IBFT, prioritize the safety property as their primary goal. This is because any discrepancy in the agreed total ordering between the replicas in the system is considered strictly unacceptable. Particularly, this is true when these BFT consensus protocols are used for blockchain applications where the consistent views of a blockchain in the system is considered the most important security property.
That said, the liveness property of BFT consensus protocols is also important for blockchain applications because otherwise blockchains may become unavailable beyond the promised maximum waiting time. This is particularly important when a blockchain system is used for real-time payment applications, where even a few second delays can cause significant damage to the entire business.
PBFT guarantees liveness when a weak synchronous network is assumed. In particular, liveness is ensured unless message delays grow faster than the timeout period
indefinitely, which is unlikely in a real system . In the original PBFT design, the timeout period doubles at every view change, and this makes the system eventually live when network delays do not catch up with the doubling timeouts.
Let us focus on the IBFT protocol , the central consensus protocol of the Klaytn system. Since IBFT is designed based on PBFT, its safety and liveness properties and conditions for them are similar to those in PBFT; however, there exists some differences in their details.
One difference we want to highlight is the block locking mechanism in IBFT. The block locking mechanism has been introduced to enhance the safety property of IBFT. However, there have been discussions about potential risks of deadlock due to the block locking mechanism in IBFT.
In 2018, a few developers discussed a potential deadlock problem in IBFT . It was argued that the locking mechanism of IBFT consensus protocol can potentially create a lock situation and in some worst cases even deadlocks. To be specific, it was raised that two non-faulty, honest nodes can lock on different blocks in some certain delayed networks. Then, the nodes do not unlock the block so the system stops processing any blocks for a while. This potential liveness problem in IBFT has been also analyzed in an arXived report .
To address the above-mentioned potential deadlock problem in IBFT, Saltini et al. in  proposes two possible approaches:
- PBFT-like approach. IBFT can remove the block locking mechanism from its consensus protocol to avoid the potential deadlock problem. Instead, it can add prepared certificates in round-change messages. If at least one valid prepared certiﬁcate exists in the set of round-change messages, then the proposer must send preprepare message including the block that has a valid prepared certiﬁcate with the highest round number.
- Tendermint-like approach. Alternatively, IBFT may take an approach used in Tendermint . It can add a locked round value to pre-prepare message, which is the latest round number of non-faulty validators locked on the block. It performs relocking when a validator receives⎡2n/3⎤prepare messages for round r and 1 pre-prepare message with locked round is r, when r is higher than current round.
In fact, a more recent variant of IBFT, called IBFT 2.0 , has adopted the first approach above and removed the block locking mechanism, thus avoiding the potential deadlock problem mentioned in Section 3.1. This 2.0 protocol was adopted and implemented as one of the consensus protocols in Pantheon , which has become Hyperledger Besu, one of the Ethereum client implements EEA (Enterprise Ethereum Alliance) specification .
We finally discuss the Klaytn’s liveness property. Klaytn is based on the original IBFT implementation  and thus many of the discussions in Section 3 can be applied to Klaytn. In this project, we first begin with the Klaytn’s current consensus implementation and evaluate its liveness property. We plan to evaluate whether the above-mentioned IBFT deadlock concerns are also applicable to Klaytn later.
Our initial analysis of the current Klaytn codebase is inspired by the actual Klaytn disruption incident that happened in March 2020 . During the incident, Klaytn stopped working for about 9 hours. Klaytn has increased the default delay from 3 seconds to 10 seconds, which seemingly has been effective so far. Since we have no access to any additional data about the incident (other than the publicly available information in ), we do not aim to reproduce or debug the incident. Instead, we attempt to investigate the Klaytn codebase and find any potential issues in the code.
Let us sketch the high-level description of one conjecture we pose regarding the liveness property of Klaytn. Note that this is only a conjecture and it has not been thoroughly evaluated with the Klaytn codebase yet. Thus, any hasty conclusion about the liveness property of Klaytn should not be drawn unless much more rigorous experiments are conducted and confirmed.
In our investigation, we focus on the function handlePreprepare. This function is called when a validator gets a pre-prepare message from the proposer. It verifies the block proposed by the proposer and, if authenticated, goes through some processes of accepting it. At this point, there are two cases in which the validator believes that it will not reach an agreement in this round and wants to move on to the next round: (1) proposal verification is failed, or (2) the current locked hash is different from the received hash. See Figure 1 below.
Figure 1. The state of c.waitingForRoundChange sets to true.
In any case, a validator sends a round change message and changes its c.waitingForRoundChange to true. Figure 2 below explains specific steps on how this really happens.
Figure 2. handleRoundChange function
When a validator receives a round-change message, the handleRoundChange function is called. In this function, when the current validator is waiting for round change (c.waitingForRoundChange == true, and it is set true when it sent round change message) and receives (f+1) round change message with same numbers larger than the current round number(num == f+1), it sends round change message.
Now, let us assume that 8 of Validators’ blocks are locked in state S and the others are not. The round change timer ends and moves on to the next round. Suppose that these 8 CNs are included in the validators in the new round, and the proposer is not among these 8. Then, these 8 CNs would send round change messages and the other 14 CNs would receive these round change messages. Even though 14 CNs got (f+1) round change messages, they are not waiting for round change(c.waitingForRoundChange != true), and they will just gossip about the message. It cannot move on to the new round and is stuck in idle situations until the round change timer expires.
To sum up, through our investigation of the current Klaytn codebase, we pose a conjecture that if n (f+1 n < 2f+1) CNs are locked and the proposer is not locked, and this condition holds for subsequent rounds, the round change time grows exponentially, potentially creating extended deadlock incidents.
For our next step, we plan to design an experiment setup with a network adversary that can arbitrarily delay consensus messages to test our conjecture regarding the liveness property of Klaytn.
 M.Castro and B.Liskov. “Practical Byzantine Fault Tolerant”. Operating Systems Design and Implementation, 1999.
 Saltini, R., & Hyland-Wood, D. (2019). Correctness analysis of IBFT. arXiv preprint arXiv:1901.07160.
 Saltini, Roberto. “IBFT liveness analysis.” 2019 IEEE International Conference on Blockchain (Blockchain). IEEE, 2019.
 Buchman, E. (2016). Tendermint: Byzantine fault tolerance in the age of blockchains (Doctoral dissertation).
 Saltini, Roberto, and David Hyland-Wood. “IBFT 2.0: A safe and live variation of the IBFT blockchain consensus protocol for eventually synchronous networks.” arXiv preprint arXiv:1909.10194 (2019).