Designing Data Intensive Applications - Chapter 8 Trouble With Distributed Systems

“This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system. We will look into problems with networks (“Unreliable Networks”); clocks and timing issues (“Unreliable Clocks”); and we’ll discuss to what degree they are avoidable. The consequences of all these issues are disorienting, so we’ll explore how to think about the state of a distributed system and how to reason about things that have happened (“Knowledge, Truth, and Lies”).”

Faults and Partial Failures

Distributed systems are unique in that they can have partial failures. Some parts of system work great, others are down

These failures are typically nondeterministic, they happen at random times

This is what makes distributed systems difficult

Cloud Computing and Super computing

There are several philosophies on how to build large-scale computing systems:

High performance computing (HPC). Supercomputers with thousands of CPUs are typically used in the scientific community.
Cloud computing
- Not well defined, but often associated with multi-tenant datacenters with elastic on demand resource allocation
Traditional enterprise lies somewhere in between

Each of these have their own problems with handling faults.

Super computer is like a single-node computer. Everything is just reset on a fault

This book focuses on internet services -> They need to always be online with low latency

Unreliable Networks

The Distributed systems in this book are shared-nothing systems, a bunch of systems connected by a network. The network is the only way those machines can communicate (They don’t share disks)

Networks are not reliable

Requests can get lost (someone unplugs a cable)
Request waits in queue and must be delivered later
Remote node can fail (Crash or powered down)
Remote stops responded (Very long garbage collection day)
Remote node can process response, but response has been delayed

The sender can’t even tell whether the packet was delievered. Sender can send a response that gets lost

Usual way of handling this is through timeouts, but you still don’t know if the receiver got it.

Network Faults in Practice

One study found a medium sized datacenter found about 12 network faults per month Human error is a major source of outages (misconfigured switches) so redundancy doesn’t help

Public services like EC2 are notorious for having frequent transient network glitches Sharks can bite undersea cables and damage them

It’s important that the software can handle these because they can always occur

Detecting faults

Many systems need to automatically detect faulty nodes for clean up

Load balancer needs to stop sending requests to a dead node
For single leader follower, dead leader needs to be rotated out

But network uncertainty means it’s hard to tell if a node is dead

If node crashed but OS is running, a script can notify the other nodes of a crash
Management interface of network switches can be queried to detect link failures at hardware level
Router can say ICMP Destination Unreachable
Kleppman also mentions that you can retry a few times and eventually call it dead
Xu always just did heart beats

Timeouts and Unbounded Delays

There’s no strong number on when to set timeout. They have pros and cons

Long timeout means users will have to wait longer or see error message
Short timeout means we can incorrectly declare a node as dead when it only had a temporary slowdown
- This is especially bad if the node was in the middle of some kind of work
- Or other nodes are busy with a lot of work already. It’ll bottleneck by shifting responsibility Asynchronous Networks have unbounded delays, there’s no upper limit on how long it takes

Network Congestion and queueing

Variability of packet delays comes down to traffic and queuing , similar to driving on a highway in your car

Data packets must wait if several nodes are sending to the same destination

If the receiving node has a full CPU, it’ll also queue the task

Virtual Machines can sporadically pause too TCP performs flow control in which a node will limit it’s own packets being sent to avoid overloading the network link

Systems can have variable timeout setting instead of a constant

They can measure their response times and automatically adjust timeouts based on the data

Synchronous Versus Asynchronous Networks

Why can’t we just solve this at the hardware level and make network reliable so that software doesn’t need to worry about it?

Imagine if it could be like phone calls on a fixed telephone network (no VoIP, non-cellular)

Phones use circuits that set a fixed amount of data to be sent. This is synchronous

Can we not make network delays predictable

Not gonna write this but TLDR -> Not really

Unreliable Clocks

On distributed systems, each node can have their own clock time

Time can be important

When has request timed out
What is 99th percentile response time fo this service
How long did the user spend on the site
Ticketmaster problem
etc.

Time is hard because communication is not instant Delays due to network. Each machine on the network has it’s on clock, usually a quartz crystal oscillator.

Monotonic Versus Time-of-Day Clocks

Computers have two types of clocks NTP -> Network Time Protocol

Time of Day

Sync to NTP and based on Gregorian calendar

Monotonic Clocks

Good for measuring duration
Difference between two checks of the running timer can give info
Don’t work among multiple CPUs with their own processes
Usually fine to use for measuring timeouts in distributed systems

Clock Synchronization and Accuracy

Clock drift is a thing in quartz clocks.

Google assumes about a 200ppm for its servers or like 6ms of drift (assuming 30 second resynchronization)
- About 17 seconds for a clock resynchronized every day
Leap Seconds are a thing and can crash entire systems
node can be firewalled off from NTP by accident
Network delay can mess things up

MiFID II draft European Regulation for high frequency trading requires insane levels of sync’d clocks (100 ms) to help detect market manipulation and flash crash

They achieve this using GPS receivers, Precision Time Protocol, and careful deployment and monitoring
Requires a lot of effort and expertise, plenty of ways to go wrong

Relying on Synchronized Clocks

Problem is that time is insanely complicated

Another problem is that incorrect clocks easily go unnoticed. Most things work fine if a clock is slightly off, but it’ll continue to drift and then cause a problem

If you have software that requires synchronized clocks, it’s crucial to monitior clock offsets between all machines.

If it’s too far drifted, need to declare machine as dead and remove it.

Timestamps for ordering events

Let’s say we have the ticket master problem and several people buy tickets. Who gets to go first in the distributed system?

Time of day is terrible here

Last Write Wins is a widely used conflict resolution strategy

Problems still occur
Database writes can disappear
LWW can’t distinguish between writes that occur in quick succession sequentially
Two nodes can generate the same timestamp, additional tie breaker value is required

Is it possible to make NTP synch accurate enough to prevent incorrect orderings? Probably not

For correct ordering, you need a more accurate clock source.

Logical Clocks, are based on incrementing counters, rather than osciallating quartz crystal
- Safer for ordering events
Don’t measure time of day, just order of events

Clock Readings have confidence interval

A clock reading has a ton of digits up to the nanoseconds, but it can be way off

It’s more like a range of possible times. It can be 95% accurate that it’s within 10.3 and 10.5 seconds

Synchronized clocks for global snapshots

Snapshot isolation is done by incrementing transaction ID, but what do we do in a distributed environment

Pretty sure Xu just gives each cluster their own hash / ID (Used in Twitter Snowflake ID I think)
And then they have their own increments

It’s possible to use timestamps from the clocks as transaction IDs

Need to handle uncertainty though

Spanner uses clock’s confidence interval based on TrueTime API

Spanner with wait the length of the confidence interval to perform the write Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized

This is an area of active research but haven’t been implemented outside of Google

Process Pauses

Let’s consider another dangerous example of clocks in distributed systems

In leader follower, how does a Leader know that it’s still a leader? That it hasn’t been declared dead by the others

One option is for leader to get lease from other nodes. Only one node can hold the lease

Must periodically renew it’s lease
Can renew lease every 10 seconds (But this relies on the clock)

If the node unexpectedly pauses for 10 seconds, then it loses it’s lease

This happens often with garabge collection in many programming languages
Not to mention virtual environments
User closes laptop and suspends
OS does context switch to another thread

Node must assume that execution can be paused for a significant length of time at any point even in the middle of a function call

Response time guarantees

You don’t want to have a pause occur during a car crash to drop out the air bags

There are ways to guarantee real-time for hardware. It’s a lot more work though

You can also limit the impact of Garbage Collection

Knowledge, Truth, Lies

What do we even know about Distributed Systems then?

We can just make assumptions about a system and state the behavior that will happen

Truth Is Defined by the Majority

A node cannot trust it’s own judgement Rely on quorum, voting to make decisions

Leader and the lock

skipping…

Fencing Tokens

skipping…

Byzantine Faults

skipping…