Overview

A consensus protocol ensures that every transaction is replicated and recorded in all the machines in the network in the same order.

In this section we will look at a special category of consensus protocols, proof-of-stake protocols, and describe Kowala’s approach to the problem and the main properties of the project elements related to the consensus protocol. This section is heavily based on the work of Ethan Buchman as well as on other resources provided by the Tendermint/Cosmos team and resources provided by the Ethereum project.

The Kowala Protocol has its own implementation of the Tendermint protocol. Tendermint is a weakly synchronous, Byzantine fault tolerant, state machine replication protocol, with optimal Byzantine fault tolerance and additional accountability guarantees in the event the BFT assumptions are violated. There are varying ways to implement Proof-of-Stake algorithms, but the two major tenets in Proof-of-Stake design are chain-based PoS and Byzantine Fault Tolerant-based PoS. The Kowala Protocol implements a hybrid of both - strictly choosing consistency over availability. Some of the main properties of the The Kowala Protocol project are:

  • The codebase is based on Ethereum’s go client
  • Fast-finality (1 second confirmations) - We believe that fast confirmations will be essential for mass adoption.
  • On-chain dynamic validator set management (registry/in-protocol penalties) via genesis smart contracts.


Pitfalls

Nothing-at-stake problem

Validators can effectively break safety by voting for multiple conflicting blocks at a given block height without incurring cost for doing so. Native PoS implementations are vulnerable to these attacks. Catastrophically, since there’s no incentive to ever converge on a unique chain and every incentive to sign duplicitously on multiple conflicting chains at once, the economically optimal strategy becomes voting on as many forks as you can find in order to reap more block rewards. Below is a diagram that demonstrates this:

Expected value for voting on competing chains is greater than expected value for voting on a single chain in naive PoS design.

In Proof-of-Work, the “penalty” for mining on multiple chains is that a miner must split up their physical hashing power (scarce resource) to do this. This attack vector is mitigated by making sure that a validator candidate pays a deposit up-front and with with in-protocol penalties. Even with penalties there’s the possibility of having consensus agents that tries to break safety - byzantine validators.

Long Range Attacks

The long range attack draws from the right that users have to withdraw their security deposits. A fundamental problem arises with this because it means an attacker can build up a fork from an arbitrarily long range without fear of being slashed. Once security deposits are unbonded, the incentive not to vote on a long-range fork from some block height ago is removed. In other words, when more than ⅔ of the validators have unbonded, they could maliciously create a second chain which included the past validator set, which could result in arbitrary alternative transactions.

Long range attacks in PoS are rectified under the weak subjectivity model, which requires the following of new nodes which come onto the network:

  • Must be currently bonded.
  • Only trust validating nodes that currently have security deposits.
  • Unbonded deposits must go through a ‘thawing’ period. After unbonding, tokens need time to ‘thaw’ for weeks to months to account for synchrony assumptions (i.e. delayed messages).
  • Forbid reverting back before N blocks, where N is the length of the security deposit. This rule invalidates any long range fork.
  • Optionally store the validator set on a PoW chain.

Tendermint adopt a simple locking mechanism (colloquially called ‘freezing’ in Tendermint) which “locks” stake for a certain period of time (weeks to months of ‘thawing’) in order to prevent any malicious coalition of validators from violating safety.

Cartel formation

A third, final hurdle facing any economic paradigm that’s worth any value faces the very real problem of oligopolies; decentralized protocols with native cryptocurrencies are no exception.

Tendermint relies on extra-protocol governance measures to combat oligopolistic validators. While there are no in-protocol measures for censorship-resistance, the rationale behind relying on out-of-band social information to tackle cartel formation is that the users would eventually and inevitably notice cartels forming, socially gossip about it, and then either abandon or vote to reorganize the blockchain that’s under attack.


Algorithm

  1. A new block is proposed by the correct proposer each round (Tendermint has a new leader for each round) and broadcasted to the other validators. The block includes a batch of transactions from the local transaction pool. If a proposer is byzantine it might broadcast different proposals to different validators.

  2. Two phases of voting called pre-vote and pre-commit. A majority (+23) of pre-commits for the same block at the same round results in a commit. If a validator does not receive a correct proposal within ProposalTimeout, it pre-votes for nil instead

In asynchronous environments with Byzantine validators, a single stage of voting, where each validator casts only one vote, is not sufficient to ensure safety. In essence, because validators can act fraudulently, and because there are no guarantees on message delivery time, a rogue validator can co-ordinate some validators to commit a value while others, having not seen the commit, go to a new round, within which they commit a different value. A single stage of voting allows validators to tell each other what they know about the proposal. But to tolerate Byzantine faults (which amounts, essentially to lies, fraud, deceit, etc.), they must also tell each other what they know about what other validators have professed to know about the proposal. In other words, a second stage ensures that enough validators witnessed the result of the first stage.

When a validator receives a more than two-thirds pre-votes for a single block, it has received a signal that the network is prepared to commit the block, and serves as justification for the validator to sign and broadcast a pre-commit vote for that block. Sometimes, due to network asynchrony, a validator may not receive a polka, or there may not have been one. In that case, the validator is not justified in signing a pre-commit for that block, and must therefore sign and publish a pre-commit vote for nil. That is, it is considered malicious behaviour to sign a pre-commit without justification from a polka.

A pre-commit is a vote to actually commit a block. A pre-commit for nil is a vote to actually move to the next round. If a validator receives more than two-thirds pre-commits for a single block, it commits that block, computes the resulting state, and moves on to round 0 at the next height. If a validator receives more than two-thirds pre-commits for nil, it moves on to the next round. A pre-vote for a block is thus a vote to prepare the network to commit the block. A pre-vote for nil is a vote to prepare the network to move to the next round.

The outcome of a round is either a commit, or a decision to move to the next round. With a new round comes the next proposer.


Network Safety

An accountable Byzantine Fault Tolerant algorithm is one that can identify all Byzantine validators when there is a violation of safety.

Committing and finalizing blocks depends on a supermajority — a >⅔ quorum — of all validators signing off on the proposed block. Note that accountability can only apply when between one-third of two-thirds of validators are Byzantine(malicious). If more than two-thirds are Byzantine, they can completely dominate the protocol, and we have no guarantee that a correct validator will receive any evidence of their misdeeds.

BFT systems can only tolerate up to a ⅓ of failures, where failures can include arbitrary or malicious behaviour:

  • crash fault - if a third or more of validators crash, the network halts, as no validator is able to make progress without hearing from more than two-thirds of the validator set. The network remains available for reads, but no new commits can be made - transactions are not confirmed. As soon as validators come back on-line, they can carry on from where they left in a round. Kowala consensus state-machine employs a write-ahead log, such that a recovered validator can quickly return to the step it was in when it crashed, ensuring it doesn’t accidentally violate a rule. For situations such as this one, the system coming to a halt, a traditional consensus system provides little or no guarantees, and typically requires manual intervention. For instance, increase the number of max validators and add nodes on-the-fly in order to be able to have a >⅔ quorum.

  • In a byzantine fault, it can behave arbitrarily (ex: send different and contradictory messages to different peers). Crash faults are easier to handle, as no process can lie to another process. Byzantine failures are more complicated. In a system of 2f + 1 processes, if f are Byzantine, they can co-ordinate to say arbitrary things to the other f + 1 processes. For instance, suppose we are trying to agree on the value of a single bit, and f = 1, so we have N = 3 processes, A, B, and C, where C is Byzantine. C can tell A that the value is 0 and tell B that it’s 1. If A agrees that its 0, and B agrees that its 1, then they will both think they have a majority and commit, thereby violating the safety condition.

A Byzantine process, C, tells A one thing and B another, causing them to come to different conclusions about the network. Here, simple majority vote results in a violation of safety due to only a single Byzantine process.

Hence, the upper bound on faults tolerated by a Byzantine system is strictly lower than a non-Byzantine one. In fact, it can be shown that the upper limit on f for Byzantine faults is f < N/3. Thus, to tolerate a single Byzantine process, we require at least N = 4. Then the faulty process can’t split the vote the way it was able to when N = 3. Systems which only tolerate crash faults can operate via simple majority rule, and therefore typically tolerate simultaneous failure of up to half of the system. If the number of failures the system can tolerate is f, such systems must have at least 2f + 1 processes.

Given that <⅔ of the validators are byzantine, there are only two ways for violation of safety to occur, and both are accountable:

1. Double Signing

Byzantine proposer makes two conflicting proposals within a round, and byzantine validators vote for both of them. In the event that they are found to double-sign proposals or votes, validators publish evidence of the transgression in the form of a transaction, which the application state can use to change the validator set by removing the transgressor, burning its deposit. This has the effect of associating an explicit economic cost with Byzantine behaviour, and enables one to estimate the cost of violating safety by bribing a third or more of the validators to be Byzantine.

2. Locking rules violation

Byzantine validators violate locking rules after some validators have already committed, causing other validators to commit a different block on a later round.

Other forms of attacks

Note that a consensus protocol may specify more behaviours to be punished than just double signing. In particular, we are interested in punishing any strong signalling behaviour which is unjustified - typically, any reported change in state that is not based on the reported state of others. For instance, in a version of Tendermint where all pre-commits must come with the polka that justifies them, validators may be punished for broadcasting unjustified pre-commits. Note, however, that we cannot just punish for any unexpected behaviour - for instance, a validator proposing when it is not their round to propose may be a basis for optimizations which pre-empt asynchrony or crashed nodes. A detailed plan for such situations will be shared soon.

Following a violation of safety, the delayed delivery of critical messages may make it impossible to determine which validators were Byzantine until some time after the safety violation is detected. In fact, if correct processes can receive evidence of Byzantine behaviour, but fail irreversibly before they are able to gossip it, there may be cases where accountability is permanently compromised, though in practice such situations should be surmountable with advanced backup solutions like a write ahead log/transaction journal.

Crisis recovery

In the event of a crisis, such as a fork in the transaction log, a traditional consensus system provides little or no guarantees, and typically requires manual intervention. Tendermint assures that those responsible for violating safety can be identified, such that any client who can access at least one honest validator can discern with cryptographic certainty who the dishonest validators are, and thereby chose to follow the honest validators onto a new chain with a validator set excluding those who were Byzantine. For instance, suppose a third or more validators violate locking rules, causing two blocks to be committed at height H. The honest validators can determine who double-signed by gossipping all the votes. At this point, they cannot use the consensus protocol, because the basic fault assumptions have been violated. Note that being able to at this point accumulate all votes for H implies strong assumptions about network connectivity and availability during the crisis, which, if it cannot be provided by the p2p network, may require validators use alternative means, such as social media and high availability services, to communicate evidence. A new blockchain can be started by the full set of remaining honest nodes, once at least two-thirds of them have gathered all the evidence. Alternatively, modifying the Tendermint protocol so that pre-commits require polka would ensure that those responsible for the fork could be punished immediately, and would not require an additional publishing period.

There are more scenarios that will be addressed such as permanent crash failures and the compromise of private keys.

Regardless of how crisis recovery proceeds, its success depends on integration with clients. If clients do not accept the new blockchain, the service is effectively offline. Thus, clients must be aware of the rules used by the particular blockchain to recover. In the cases of safety violation described above, they must also gather the evidence, determine which validators to remove, and compute the new state with the remaining validators.