Designing Data Intensive Applications - Chapter 9 Consistency and Consensus

Last chapter covered all things that can go wrong in distributed systems, this chapter will go over algorithms that can fix those problems

Best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.

Consensus, getting all nodes to agree on something, is one of the most important abstractions

First we’ll explore the range of guarantees and abstractions that can be provided in a distributed system.

Need to understand what can be done and what can’t be done

Researchers in this field have been studying these topics for decades, so this is really just the surface. There are literature references if you’d like to read more

Consistency Guarantees

Two nodes will usually always have different states, because writes requests take time to travel to them.

Most replicated databases provide at least eventual consistency, which means that if you stop writing to the database and wait for some (unspecified) length of time, eventually all read requests will return the same value

The inconsistency is temporary

Kleppman says a better name for this should be convergence, as we expect all replicas to eventually converge to the same value

Very weak guarantee, doesn’t say anything as to when they’ll converge. This is very hard for application developers, it’s not consistent on reads of values

This chapter will go into stronger consistency models that data systems use


Also called, strong consistency or atomic consistency

Gives the illusion that there’s only one state of data

Need to make just the most recent write doesn’t come from a stale cache or replica

What makes a System Linearizable

After a write occurs, all reads will show the same value after some time (And before)

Lines should always move forward in time.

![[Screenshot 2024-05-07 at 9.45.44 AM.png]]

Linearizability vs. Serializability

Both seem like they say “Can be arrange in sequential order” but there’s differences

Serializability is isolation property of transactions.

Linearizability is a guarantee on reads and writes of a register / individual objects

Database may provide both serializability and linearizability. This is known as strict serializability or strong one-copy serializability (strong 1SR)

Serializable snapshot isolation is not linearizable

Relying on Linearizability

Where is this useful?

Locking and Leader Selection

Single Leader replication needs to ensure there’s only one leader

Coordination services like ZooKeeper and etcd are often used to implement distributed locks and leader election. They use consensus algorithms to implemement linearizable operations in fault tolerant way.

Many subtle details to implementing locks and leader election, and libraries that help providing higher level recipes on top of ZooKeeper

Distributed Locking is also used at much more granular level in some distributed databases like Oracle Real Application Clusters (RAC)

Constraints and uniqueness guarantees

Username and email must be unique

We need to enforce this constraint amount distributed systems Situation is similar to a lock

Cross-Channel Timing Dependencies

In the inconsistent football score example, it was only a problem because there was another communication channel (Two people sitting in the same room talking about the game)

Similar things can happen for computer systems, like uploading a photo and having a process that resizes the photo

Without linearizability, there can be a race condition where the image resizer is ready to start the image but it doesn’t exist.

Problem arises because there’s two different communication channels, the file storage upload and the message queue sending the request to the resizer.

Linearizability is not the only way to avoid it, but it’s the simplest to understand

Additional communication channels make it more difficult (unless you control them), or use alternative approaches

Implementing Linearizable Systems

Single-Leader Replication (Potentially Linearizable)

Consensus algorithms (linearizable)

Multi-Leader Replication (not linearizable)

Leaderless Replication (probably not)

Linearizability and quorums

They seem to have linearizability in a Dynamo-style model, but when there are variable network delays, it’s possible to have race conditions

It’s possible to just wait even longer for all reads / writes to complete. This works, but costs performance

Cost of Linearizability

In multi-leader database, if a network interruption happens between datacenters, things still work perfectly fine

In single-leader, then the leader must be in the active datacenter. Any writes and any reads must be sent to leader.

Cap Theorem

This is explained in CAP theorem

You sacrifice availability for stronger consistency

The unhelpful CAP theorem

CAP theorem doesn’t mention anything about network delays, dead nodes, or other trade-offs

Linearizability and Network Delays

Very few systems are actually linearizable in practice. RAM on a modern multi-core CPU isn’t.

CPU cores have their own memory and store buffer. It’s faster to just read from that

Most non-linearizable distributed databases do this for performance, not availability

Ordering Guarantees

Ordering has been a recurring theme in this book, which suggests it might be an important fundamental idea.

There is a deep connection between ordering, linearizability, and consensus. Although this idea is more theoretical, Kleppmann says it’s worth understanding what systems can and can’t do

Ordering and Causality

Ordering is important to preserve causality.

We don’t want to be given an answer, when no question was asked

Like in real life, one thing leads to another

If a system obeys ordering imposed by causality, It is called causally consistent

The Causal Order is not Total Order

Total order allows any two elements to be compared. Unlike a mathmatical set where there are just numbers.

Linearizability -> Has total order, system only has one copy of data, and every operation is atomic. This means that any two operations will know what happened first

Causality -> Partial Order

Linearizability is stronger than casual consistency

Linearizability implies casualty any system that is linearizable will preserve causality correctly.

This is great and simple, but of course, comes at the cost of performance and availability. Some distributed systems abandon it for this reason.

Good news is that a middle ground is possible

Casual Consistency is the strongest possible consistency model that does not slow down due to network delays and remains available in the face of network failures. It also doesn’t reduce perfomance as much as strong consistency

In many cases, systems that seem to require strong consistency, really only need casual consistency

Based on this observation, researchers are exploring new kinds of databases that preserve casuality, with performance and availability characteristics similar to those of eventually consistent systems

Research is still recent. Not much there

Capturing casual dependencies

Won’t talk about how non-linearizable systems can maintain causal consistency, but just talk about the key ideas

To maintain causality, you need to know which operation happened before which other operation. This is a partial order.

concurrent operations may be processed in any order, but if one happens before another, then they must be processed in that order (in every replica)

replica must be ensured that its processing in the proper order

Need a way of describing knowledge of a node in the system. If node 1 has already seen value X when it issued the write Y, then X and Y may be casually related.

Similar to detecting concurrent writes section, where in a leaderless datastore, we need to detect concurrent writes to same key in order to prevent lost updates.

Database needs to know which version of the data was read by application. The version number from prior operation is passed back to the database on a write.

Sequence Number Ordering

Causality is an important theoretical concept, but keeping track of all casual dependencies can impractical.

There is a better way

Sequence numbers or timestamps to order events

In single-leader replication, it can just give each operation a counter number. The follower will always know the state of the main node.

Non-causal Sequence Number Generators

it’s less clear how to generate sequence numbers for operations for leaderless or multi-leader databases.

Various methods are

These three options perform better and are more scalable than pushing all operations through a single leader that increments a counter. They all have a problem though.

The numbers they generate are not consistent with causality

Causality problems occur when number generators do not correct capture ordering of operations across nodes.

Lamport Timestamps

Solution is Lamport timestamp. One of the most cited papers in field of distributed systems

Each node is given unique identifier. Each node keeps counter of the number of operations processed.

Lamport timestamp is the pair (counter, NodeID)

Timestamp order is not sufficient

If we want uniqueness of usernames and have two con-concurrent users, problem arises

Need Total Order Broadcast

Total Order Broadcast

Kleppmann talked about how we can use sequence numbers, but those have uniqueness constraint problem.

Single-Leader replication works, but it doesn’t scale. How can we scale for larger throughput? This idea is called total order broadcast or atomic broadcast in Distributed System literature

Usually described as protocol for exchanging messages between nodes.

Using total order broadcast

Coordination services like ZooKeeper and etcd actually implement this. There is a strong correlation between total order broadcast and consensus

Can do a lot with total order broadcast

Total Order Broadcast creates a log and delivering message is like appending to the log. All nodes are able to read the log

Also useful in implementing a lock service that requires fence tokens

Implementing Linearizable storage using total order broadcast

If you have total order broadcast you can implement linearizable storage, let’s say usernames for example

Use an append-only log

To implement this

Distributed Transactions and Consensus

Consensus is one of the most important and fundamental problems in distributed computing

One surface seems simple -> Just get all nodes to agree on something. Unfortunately many broken systems have been built in the mistaken belief that this problem is easy to solve.

Although consensus is important, Kleppmann put it at the end because the topic is very subtle and requires prerequisite knowledge. Now that he’s gone over replication, transactions, and system models, it’s time to go over the consensus problem.

Situations where it’s important for nodes to agree

Impossibility of Consensus

FLP result -> Proves their no algorithm that is always able to reach consensus if there is a risk that a node may crash

Distributed systems can usually achieve consensus in practice. FLP is a good theoretical problem though

Back to chapter

In the following chapters, Kleppmann will go over atomic commit problem in more detail. We will discuss the two-phase-commit(2PC). The most common way of solving atomic commit problem.

By learning from 2PC we will work our way toward better consensus algorithms used in ZooKeeper(Zab) and etcd(Raft)

Atomic Commit and Two Phase Commit (2PC)

In Chapter 7, we learned that the purpose of transaction atomicity is to provide simple semantics in the case where something goes wrong in the middle of making several writes. Outcome is either successful commit or an abort to roll back.

Atomicity prevents failed transactions filling the DB with half-finished results / state

From Single Node to Distributed Atomic Commit

Single Database node

What if we try this on multiple nodes

Introduction to Two-Phase Commit

Algorithm for achieving atomic transaction commit across multiple nodes

2PC and 2PL are completely different things

2PC uses a coordinator (Transaction manager). Coordinator is implemented as a library with same application process that is requesting the transaction (e.g embedded in a Java EE container)

2PC transaction beings with application reading and writing data on multiple nodes as normal. Database nodes are called participants in transaction.

When application is ready to commit, Coordinator begins phase 1: it sends a prepare request to each of the nodes asking them whether they are able to commit.

Coordinator tracks responses from participants.

Process is similar to a traditional marriage ceremony in Western cultures.

A System of Promises

Going more into detail

When application wants to being distributed transaction, it requests transaction ID. It’s globally unique.

Application begins single node transaction on each of the participants and attaches the unique ID to the transaction. All reads and writes are done, if anything goes wrong the coordinator will abort

When application is ready to commit, coordinator sends out prepare request, if someone can’t, it aborts all of them

When participant receives request, it makes sure that it can definitely commit the transaction under all circumstances. Replying yes means it promises it will happen.

When coordinator receives all responses, it will write the decision on disk so it has a log in case of a crash

Once coordinator’s decision has been written to disk, commit or abort request is sent to all participants. If the request fails or times out, the coordinator must retry forever until it succeeds. There is no more going back if the decision was to commit.

Protocol contains two crucial points of no return.

Coordinator Failure

What happens if the coordinator crashes?

If coordinator fails before sending the prepare requests, a participant can safely abort the transaction. Once participant has received a prepare request and voted “yes”, it can no longer abort. It must wait to hear back from coordination

Timeouts don’t work because node1 and abort but it can be inconsistent with node2 (???)

Only way is to wait for coordinator to recover

Three Phase Commit

Two Phase commit is called a blocking atomic commit protocol due to the fact that 2PC can become stuck waiting for coordinator recovery.

An alternative to two phase commit has been proposed, Three Phase Commit (3PC)

Guarantees atomicity with bounded delay and nodes with bounded response times

Nonblocking atomic commit requires a perfect failure detector

2PC continues to be used for this reason

Distributed Transactions in Practice

Distributed transactions, especially 2PC ones, have a mixed reputation

Distributed transactions in MySQL are reported to be 10x slower than single-node transactions

Instead of auto dismissing distributed transactions, let’s examine them in more detail

There’s important lessons to be learned in them

Two Quite different types of Distributed Transactions are often conflated:

Exactly-Once Message Processing

Heterogenoeous distributed transactions allow diverse systems to be integrated in powerful ways. Imagine a message queue that can only send a message if a DB transaction for processing the message was successful.

We can guarantee exactly once message processing because it will either commit or fail

Systems must use the same atomic commit protocol. Probably would need to support 2PC

Kleppmann will return to this in Chapter 11

XA Transactions

Standard way to implement 2PC across different technologies

Not a network protocol, a C API for interfacing with a transaction coordinator

Check if drivers support XA. Load library into same process

If app crashes, coordinator goes with it Database drivers will ask for the coordinator

Holding locks while in doubt

Being in doubt is bad because a participant can hold onto a lock and block everyone else

If Coordinators log is lost, those rows will be locked and held forever

Recovering from Coordinator Failure

Rebooting the DB won’t fix this problem

Need to have an admin manually do it

XA also has a heursitic decisions tool but use it sparingly to not violate atomicity

Limitations of Distributed Transactions

Coordinator needs to not be run on a single machine (single point of failure for entire system)

Coordinator’s logs are part of server state. Servers are no longer stateless

XA needs compatibility

Fault Tolerant Consensus

Improving this model

VSR, Paxos, Raft, and Zab implement a different way to avoid these problems above

Decide on sequence of values


Limitations of Consensus

Rely on timeouts to detect failed nodes Require strict majority to operate Assume a fixed set of nodes that participate in voting Sometimes particularly sensitive to network problems

Membership and Coordination Services

Hadoop, Nova, and Kafka all rely on ZooKeeper running in the background. What is that these projects get from it?

Strongly consistency Total Ordering of Operations Failure detection Change notifications

Allocating work to nodes

Service Discovery

Finding out what IP address to connect to in order to reach a service

Probably doesn’t require consensus. Leader election does though

Membership SErvices

All of this comes from research into Membership services in the 1980s which was needed for air traffic control