The 8th KIR: NetS&P Lab (KAIST)_Securing and Improving BFT Consensus Protocols with Advanced Networking_Progress Report(4)


For much improved transaction volume, latency, and finality, many blockchain systems utilize BFT consensus protocols. In particular, permissioned blockchains enjoy the full benefits of BFT consensus among a small (e.g., a few tens or hundreds) group of distributed nodes. One remaining problem though is the limited scalability of BFT consensus protocols. In particular, the many existing protocols, such as PBFT, Istanbul BFT, have quadratic message complexity to the number of replicas in the system, and this makes the blockchains hard to scale beyond a few tens of nodes participating in consensus. Some newer proposals have attempted to reduce the message complexity via committee selections, threshold signatures, and trusted hardware.

In this proposal, we ask whether the current network infrastructure is properly used for the BFT consensus protocols in permissioned blockchains and investigate if emerging network primitives can further improve the throughput performance. First, we plan to study whether the current Internet infrastructure is properly used for scalable operation of the Istanbul BFT consensus protocol in real large-scale blockchain networks like Klaytn. In particular, we will perform an in-depth evaluation on the effect of the current underlying Internet architecture of the commercial blockchains on their throughput performance. Furthermore, we aim to utilize emerging hardware-based networking primitives, such as programmable switches, to further speed up the message propagation in the Istanbul BFT consensus protocol.

Key Deliverables - Technical report

For our KIR milestone #4, we submit this technical report on the progress of our on-going effort for testing the network degradation attacks against the IBFT based consensus algorithm, and exploring ways to reduce the message complexity of the IBFT implementation. First, we have improved our network degradation attacks in our SDN-based testing framework. We summarize our findings and present a new machine-learning based approach for the attacks. Second, we have explored several existing work to review the known ways to offload the consensus burden from the application layer to the network layer. We discuss our approach and summarize the feasibility with programmable switches.

1 Emulating Network Degradation for Testing Consensus Liveness

1.1 Network degradation in a test environment

On the SDN-based testing framework of network degradation (as we reported in the previous milestone), we have tested the network degradation capability for the IBFT consensus algorithm. The first task we have performed is to identify the message types based on the IP packet length of packets our network adversary observes in between CN nodes.

From our observation of IP packets between CNs and the source code analysis of the Klaytn’s IBFT implementation, we learn that the four IBFT message types have quite different packet lengths in general.

Non-trivial behaviors with RLPx. As we test our network degradation capability in the test environment, we have observed one interesting behavior in the peer-to-peer network in Klaytn’s IBFT. The IBFT implementation shares its lower-layer networking stack with Ethereum since Klaytn had forked Ethereum at its early stage. RLPx is a transport protocol, which is based on TCP, and it is used to communicate between peer nodes in IBFT. RLPx handles the basic encryption and authentication of messages between peers and the keys are shared between peers during connection establishment.

Understanding the RLPx protocol is crucial for network degradation attacks since some message exchange patterns are hard to understand without the knowledge of RLPx. Let us highlight two notable behaviors that are particularly specific to the RLPx protocol. First, each consensus message type is framed and sent to the network with multiple IP packets. This is due to the framing in RLPx that enables multiplexing of multiple capabilities over a single connection. RLPx defines p2p capabilities and assigns one or more of them to a single p2p connection. Without understanding the framing and capabilities in RLPx, network adversaries cannot exactly identify consensus messages between CNs. Second, when a single consensus message from CN1 to CN2 is dropped due to attacks, the same message type is often relayed through other CN nodes and eventually arrives at CN2. We notice this via multiple experiments in the test platform and found the reason is the additional gossiping mechanism (via GossipSubPeer) in the current IBFT implementation. This requires a network degradation adversary to drop not only a single message from a target CN-to-CN link but also disrupt other links to ensure that the targeted message type is completely removed in the round.

1.2 Consensus state inference problem for longitudinal network degradation attacks

As we have designed the network degradation targeting BFT based consensus algorithms, we realize an interesting research problem that any practical network adversary should overcome for longitudinal network degradation attacks. We describe the problem in this section and overview our approach to solving the problem.

Attack goals. Let us first clarify the goals of network degradation attackers. Network adversaries would wish to degrade the network message exchanges between consensus nodes or CNs so that the liveness of the consensus algorithm is significantly affected. There are two desired sub-goals of the attacks. First, the network degradation wishes to remain minimal (e.g., delaying or dropping only a few consensus messages) so that the attacks become hard to detect. Second, the network degradation wishes to remain effective for a longer period of time (e.g., few tens of minutes or hours) to make the target blockchain consensus algorithm unavailable for an extended period of time.

Basic attack strategy. Network degradation attacks can be easily implemented by first identifying message types of a target BFT consensus algorithm. For example, in IBFT, an adversary observes the lengths and timings of packets she conveys and infers the type of messages. This is crucial for our network degradation adversaries because only after this they can selectively degrade some message types depending on the attack strategies. In this discussion, we consider that adversaries would utilize the timing of observed packets between CNs but not their packet length. This is because packet lengths can be easily altered for obfuscation by the CNs (e.g., padding messages individually) and become less useful to the network adversaries. Unlike packet length, adversaries can still access the timing information of the consensus messages because BFT algorithms have very well structured message sequences. Altering the order of the message types in BFT consensus algorithms (e.g., sending a commit message before a prepare message) would break the reliable operation of the BFT consensus algorithms. In other words, BFT consensus protocols are typically not designed to tolerate out-of-order messages. Thus, just by monitoring message exchanges between CNs, an attacker can accurately infer the message types only with their timing information.

Once a network adversary identifies the message types between CN nodes, she can target a certain set of messages between targeted CN nodes and degrade their delivery. For example, to trigger repeated round changes, a network adversary may wish to drop prepare messages of a few carefully chosen CN pairs in each round. One round after another, the adversary changes the target CN pairs for dropping prepare messages. When successfully executed, this attack can trigger many round changes in a row, resulting in a longitudinal denial-of-service attack.

A non-trivial challenge of longitudinal attacks. The problem is that when some consensus messages are delayed or dropped in one round due to our attacks, some CN nodes lag behind and their states do not progress. This is what our network adversary has intended (thus, the attack is successfully executed in that round); however, this affects the operation of CN nodes in the next round. To be more specific, when some CN nodes are behind and remain in the previous round and some already have progressed to the next round, their behaviors differ significantly and their message timings also change. What this means to our network adversary is that she can no longer rely on the simple timing-based message type inference to learn the types of consensus messages between CN nodes. This renders our basic attack strategy for network degradation significantly less effective after one round of consensus degradation.

Potential solution: real-time learning of node state transitions. Since the simple message inference quickly becomes ineffective when mounting a longitudinal attack, our network degradation attacks require a more sophisticated attack strategy that accurately infers not only the message types but also the state transitions of each CN node in the system. For example, in IBFT [1], a CN node has at least six states: new round, pre-prepared, prepared, committed, final committed, and round change. In fact, reading the source code of IBFT in Klaytn, we learn that there exist a few more states that are created to specify the locks in each node.

A CN node experiences several state changes in a single round, and what we wish to achieve is to learn how each CN node changes its state as time progresses within a round. With this learning, our network adversary can accurately identify the message types of packets sent by each CN node and make a clear decision whether she wants to delay or drop any particular packet she observes. The desired learning process should be fast enough so that the adversary can infer the node’s state before it delivers another message and changes to another state. We envision that a machine learning based approach would be promising for this task, especially considering the complexity of the state transition of the current IBFT-based Klaytn implementation.

1.3 Remaining tasks

  • Complete the message delay/drop capability of network degradation attacks
  • Sketch the learning approach for node state transition inference

2 Investigating the Idea of Consensus Offloading

We continue to investigate the idea of consensus offloading and share our recent progress in this section. We first discuss our main motivation for the idea of consensus offloading by reviewing NOPaxos [2]. We then review one particularly related structure of consensus algorithms found in Hyperledger Fabric. Last, we review and summarize existing literature on cryptographic operations utilizing P4 switching fabrics, and discuss how P4 switches can potentially be used to offload the burdens of consensus algorithms. We close this section by summarizing a few tasks for the next milestone.

2.1 Motivation for consensus offloading

Server failures due to hardware/software errors are common in data centers. Typically, in data centers, a distributed server platform is designed and implemented to operate a state-machine replication algorithm. Paxos [3] is one of the earliest state machine replication algorithms and now widely used in data centers for their reliable operation in the face of failures. Two main disadvantages are throughput bottleneck at leader nodes and increased latency. Paxos has two phases, i.e., prepare and commit phases, which increases the end-to-end latency for each consensus round. And each phase, a server that takes the role of a leader node experiences significant throughput bottleneck since it handles all the messages for other replicas.

A recent proposal, called NOPaxos [2], aims to greatly simplify the consensus complexity of Paxos with the help of simple processing at the network fabric that interconnects multiple servers that participate in the Paxos consensus. In NOPaxos, the network layer provides a new network primitive called ordered un-reliable multicast (OUM). The OUM primitive is, in fact, a very simple network operation; that is, a single node, called sequencer, stamps all the messages it sees with the increasing index and sends them to the servers. This simple indexing operation is easy to implement in the existing and emerging network equipment, such as software-defined networking and programmable switches, and this simplifies the Paxos algorithm.

Paxos, as a state-machine replication algorithm, offers two consensus properties: (1) ordering–all replicas process requests in the same order; and (2) reliable delivery–every request from a client is either processed by all replicas or none. NOPaxos, with the OUM network primitive, aims to move the burden for the first property of Paxos from the application layer to the network layer. To that end, NOPaxos implements a new application layer for its consensus algorithm that achieves consensus with a single phase message exchange, which is the theoretical minimum for consensus, in normal cases. This is made possible because the new network entity, sequencer, ensures the ordering of messages and thus the participating servers only need to guarantee whether all messages are processed by all replicas or none. Ensuring all-or-nothing in a distributed environment with crash failures is a trivial task. Following NOPaxos, several similar ideas have been proposed, including NetPaxos [4], p4xos [5].

Designing a similar consensus offloading for BFT algorithms? One interesting question remains whether we can expect any benefits from the idea of consensus offloading (like NOPaxos enjoys with the new ordering primitive) for consensus algorithms that are designed to tolerate Byzantine faults. A naive approach we take first is to make few assumptions for our consensus algorithm: (1) a sequencer node is at the center of all transactions (i.e., all transactions must traverse the sequencer); (2) a sequencer node is trustworthy (i.e., adversaries cannot compromise this node); and (3) all encryption and authentication schemes are modified so that a sequencer node can read and modify transactions as needed. With these assumptions, we plan to design a simplified consensus algorithm and later remove these assumptions to further develop a more practical BFT consensus algorithm.

2.2 Orderers in Hyperledger Fabric

We investigate one interesting structure among existing BFT consensus systems; that is, the ordering service in Hyperledger Fabric. Hyperledger Fabric is designed for permissioned blockchains. One special set of nodes are ordering nodes or orderers and they are responsible for the ordering of transactions to the system.

In Hyperledger Fabric, all transactions must first be sent to one or more of the endorsing peers that checks the validity of the transactions. Then, the transactions are sent to one or more of the orderers. The orderers execute a consensus algorithm, such as Raft, to construct a block that contains a sequence of transactions that have been validated. The new block then is sent to a leader node, which distributes the block to other peers via a gossip protocol.

The Hyperledger Fabric documentation [6] states that there are no clear constraints for the number of orderers. Even a single orderer can operate a blockchain. The consensus between orderers is achieved with consensus algorithms that only tolerate crash failures, like Raft, but not necessarily Byzantine failures. For implementing BFT consensus with Hyperledger Fabric, we believe that the consensus among the orderers should be guaranteed by a BFT consensus algorithm. And this architecture does not, in fact, reduces the complexity of BFT algorithms but only to hand over the consensus complexity to the set of orderer nodes, which are also susceptible to Byzantine attacks.

Therefore, in summary, we conclude that the ordering service in Hyperledger Fabric is merely another form of traditional BFT algorithms that focus on a small group of special nodes. This structure is indeed far from the idea of consensus offloading to the network layer we have envisioned.

2.3 Cryptographic operations in P4 switching fabrics

One important requirement for designing and implementing ordering services in the network layer is to implement necessary cryptographic operations inside the switching fabric. Programmable switches, such as P4 switches, do have strong programming capability that process packets at line rates. However, it is still an active area of research on what cryptographic capabilities can be reliably implemented in P4 switches and applied to packets at line rates.

We survey a few recent academic literature to learn that it is indeed feasible to implement encryption and authentication schemes with P4 switches after some modifications to existing schemes. The key is in the design of cryptographic hash function in the limited P4 hardware. Datta et al. [7] show that a specific hash function called SipHash can be implemented with P4 switches.

SipHash is an XOR-based efficient cryptographic hashing algorithm [8]. SipHash has several benefits. First, SipHash is designed and evaluated to be a cryptographically strong PRF (pseudorandom function), i.e., indistinguishable from a uniform random function. This implies that it can be used as a safe replacement of block ciphers. SipHash is also much faster for short inputs than existing strong MACs (and PRFs), and is even comparable in speed to popular non-cryptographic hash functions. SipHash iterates a simple round function consisting of four additions, four xors, and six rotations, interleaved with xors of message blocks, thus guaranteeing fast operations. SipHash states consist of four 64-bit variables. This small state size allows SipHash to perform well on a wide range of CPUs and to fit into small hardware.

In a project called SPINE [7], SipHash is used to encrypt outgoing packets at one switch and decrypt incoming packets at another switch. To be more specific, a one-time pad generated using SipHash is used to encrypt IPv4 packet headers. Since SipHash is a PRF, the generated pads are random (uniformly distributed in the set of all possible keys and independent of the plaintext), satisfying the requirements for secure encryption. Keys are shared in advance, and periodically rotated by a trusted centralized controller. This work clearly shows that it is feasible to implement symmetric encryption/decryption with SipHash in P4, and still process packets at switch hardware rate. In a similar way, message authentication codes (MAC) may be implementable with SipHash in P4 switches.

That being said, there are some potential obstacles for implementing general encryption and MAC functions with SipHash. First, SipHash processes the message in 64-bit words, but the P4 specification does not allow loop statements. However, for messages with variable length, we may need to apply XOR with the SipHash outputs repeatedly to the plaintext. This remains an open challenge. More on current support of hashing algorithms on p4 targets are found in [9].

2.4 Remaining Tasks

  • Prototype a consensus offloading with a hypothetical sequencer node
  • General encryption and authentication operations with P4 switches


[1] Istanbul Byzantine Fault Tolerance · Issue #650 · ethereum/EIPs · GitHub

[2] Li, Jialin, et al. “Just say NO to paxos overhead: Replacing consensus with network ordering.” 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016.

[3] Lamport, Leslie. “Paxos made simple.” ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) (2001): 51-58.

[4] Dang, Huynh Tu, et al. “Netpaxos: Consensus at network speed.” Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research. 2015.

[5] Dang, Huynh Tu, et al. “P4xos: Consensus as a network service.” IEEE/ACM Transactions on Networking 28.4 (2020): 1726-1738.

[6] A Blockchain Platform for the Enterprise — hyperledger-fabricdocs main documentation

[7] Datta, Trisha, et al. “SPINE: Surveillance protection in the network elements.” 9th USENIX Workshop on Free and Open Communications on the Internet (FOCI 19). 2019.


[9] Scholz, Dominik, et al. “Cryptographic hashing in p4 data planes.” 2019 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS). IEEE, 2019.

1 Like