Designing Data Intensive Applications - Chapter 8 Trouble With Distributed Systems

“This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system. We will look into problems with networks (“Unreliable Networks”); clocks and timing issues (“Unreliable Clocks”); and we’ll discuss to what degree they are avoidable. The consequences of all these issues are disorienting, so we’ll explore how to think about the state of a distributed system and how to reason about things that have happened (“Knowledge, Truth, and Lies”).”

Faults and Partial Failures

Distributed systems are unique in that they can have partial failures. Some parts of system work great, others are down

These failures are typically nondeterministic, they happen at random times

This is what makes distributed systems difficult

Cloud Computing and Super computing

There are several philosophies on how to build large-scale computing systems:

Each of these have their own problems with handling faults.

This book focuses on internet services -> They need to always be online with low latency

Unreliable Networks

The Distributed systems in this book are shared-nothing systems, a bunch of systems connected by a network. The network is the only way those machines can communicate (They don’t share disks)

Networks are not reliable

The sender can’t even tell whether the packet was delievered. Sender can send a response that gets lost

Usual way of handling this is through timeouts, but you still don’t know if the receiver got it.

Network Faults in Practice

One study found a medium sized datacenter found about 12 network faults per month Human error is a major source of outages (misconfigured switches) so redundancy doesn’t help

Public services like EC2 are notorious for having frequent transient network glitches Sharks can bite undersea cables and damage them

It’s important that the software can handle these because they can always occur

Detecting faults

Many systems need to automatically detect faulty nodes for clean up

But network uncertainty means it’s hard to tell if a node is dead

Timeouts and Unbounded Delays

There’s no strong number on when to set timeout. They have pros and cons

Network Congestion and queueing

Variability of packet delays comes down to traffic and queuing , similar to driving on a highway in your car

Data packets must wait if several nodes are sending to the same destination

If the receiving node has a full CPU, it’ll also queue the task

Virtual Machines can sporadically pause too TCP performs flow control in which a node will limit it’s own packets being sent to avoid overloading the network link

Systems can have variable timeout setting instead of a constant

Synchronous Versus Asynchronous Networks

Why can’t we just solve this at the hardware level and make network reliable so that software doesn’t need to worry about it?

Imagine if it could be like phone calls on a fixed telephone network (no VoIP, non-cellular)

Phones use circuits that set a fixed amount of data to be sent. This is synchronous

Can we not make network delays predictable

Not gonna write this but TLDR -> Not really

Unreliable Clocks

On distributed systems, each node can have their own clock time

Time can be important

Time is hard because communication is not instant Delays due to network. Each machine on the network has it’s on clock, usually a quartz crystal oscillator.

Monotonic Versus Time-of-Day Clocks

Computers have two types of clocks NTP -> Network Time Protocol

Time of Day

Monotonic Clocks

Clock Synchronization and Accuracy

Clock drift is a thing in quartz clocks.

MiFID II draft European Regulation for high frequency trading requires insane levels of sync’d clocks (100 ms) to help detect market manipulation and flash crash

Relying on Synchronized Clocks

Problem is that time is insanely complicated

Another problem is that incorrect clocks easily go unnoticed. Most things work fine if a clock is slightly off, but it’ll continue to drift and then cause a problem

If you have software that requires synchronized clocks, it’s crucial to monitior clock offsets between all machines.

If it’s too far drifted, need to declare machine as dead and remove it.

Timestamps for ordering events

Let’s say we have the ticket master problem and several people buy tickets. Who gets to go first in the distributed system?

Time of day is terrible here

Last Write Wins is a widely used conflict resolution strategy

Is it possible to make NTP synch accurate enough to prevent incorrect orderings? Probably not

For correct ordering, you need a more accurate clock source.

Clock Readings have confidence interval

A clock reading has a ton of digits up to the nanoseconds, but it can be way off

It’s more like a range of possible times. It can be 95% accurate that it’s within 10.3 and 10.5 seconds

Synchronized clocks for global snapshots

Snapshot isolation is done by incrementing transaction ID, but what do we do in a distributed environment

It’s possible to use timestamps from the clocks as transaction IDs

Spanner uses clock’s confidence interval based on TrueTime API

This is an area of active research but haven’t been implemented outside of Google

Process Pauses

Let’s consider another dangerous example of clocks in distributed systems

In leader follower, how does a Leader know that it’s still a leader? That it hasn’t been declared dead by the others

One option is for leader to get lease from other nodes. Only one node can hold the lease

If the node unexpectedly pauses for 10 seconds, then it loses it’s lease

Node must assume that execution can be paused for a significant length of time at any point even in the middle of a function call

Response time guarantees

You don’t want to have a pause occur during a car crash to drop out the air bags

There are ways to guarantee real-time for hardware. It’s a lot more work though

You can also limit the impact of Garbage Collection

Knowledge, Truth, Lies

What do we even know about Distributed Systems then?

We can just make assumptions about a system and state the behavior that will happen

Truth Is Defined by the Majority

A node cannot trust it’s own judgement Rely on quorum, voting to make decisions

Leader and the lock


Fencing Tokens


Byzantine Faults