“This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system. We will look into problems with networks (“Unreliable Networks”); clocks and timing issues (“Unreliable Clocks”); and we’ll discuss to what degree they are avoidable. The consequences of all these issues are disorienting, so we’ll explore how to think about the state of a distributed system and how to reason about things that have happened (“Knowledge, Truth, and Lies”).”
Faults and Partial Failures
Distributed systems are unique in that they can have partial failures. Some parts of system work great, others are down
These failures are typically nondeterministic, they happen at random times
This is what makes distributed systems difficult
Cloud Computing and Super computing
There are several philosophies on how to build large-scale computing systems:
- High performance computing (HPC). Supercomputers with thousands of CPUs are typically used in the scientific community.
- Cloud computing
- Not well defined, but often associated with multi-tenant datacenters with elastic on demand resource allocation
- Traditional enterprise lies somewhere in between
Each of these have their own problems with handling faults.
- Super computer is like a single-node computer. Everything is just reset on a fault
This book focuses on internet services -> They need to always be online with low latency
Unreliable Networks
The Distributed systems in this book are shared-nothing systems, a bunch of systems connected by a network. The network is the only way those machines can communicate (They don’t share disks)
Networks are not reliable
- Requests can get lost (someone unplugs a cable)
- Request waits in queue and must be delivered later
- Remote node can fail (Crash or powered down)
- Remote stops responded (Very long garbage collection day)
- Remote node can process response, but response has been delayed
The sender can’t even tell whether the packet was delievered. Sender can send a response that gets lost
Usual way of handling this is through timeouts, but you still don’t know if the receiver got it.
Network Faults in Practice
One study found a medium sized datacenter found about 12 network faults per month Human error is a major source of outages (misconfigured switches) so redundancy doesn’t help
Public services like EC2 are notorious for having frequent transient network glitches Sharks can bite undersea cables and damage them
It’s important that the software can handle these because they can always occur
Detecting faults
Many systems need to automatically detect faulty nodes for clean up
- Load balancer needs to stop sending requests to a dead node
- For single leader follower, dead leader needs to be rotated out
But network uncertainty means it’s hard to tell if a node is dead
- If node crashed but OS is running, a script can notify the other nodes of a crash
- Management interface of network switches can be queried to detect link failures at hardware level
- Router can say ICMP Destination Unreachable
- Kleppman also mentions that you can retry a few times and eventually call it dead
- Xu always just did heart beats
Timeouts and Unbounded Delays
There’s no strong number on when to set timeout. They have pros and cons
- Long timeout means users will have to wait longer or see error message
- Short timeout means we can incorrectly declare a node as dead when it only had a temporary slowdown
- This is especially bad if the node was in the middle of some kind of work
- Or other nodes are busy with a lot of work already. It’ll bottleneck by shifting responsibility Asynchronous Networks have unbounded delays, there’s no upper limit on how long it takes
Network Congestion and queueing
Variability of packet delays comes down to traffic and queuing , similar to driving on a highway in your car
Data packets must wait if several nodes are sending to the same destination
If the receiving node has a full CPU, it’ll also queue the task
Virtual Machines can sporadically pause too TCP performs flow control in which a node will limit it’s own packets being sent to avoid overloading the network link
Systems can have variable timeout setting instead of a constant
- They can measure their response times and automatically adjust timeouts based on the data
Synchronous Versus Asynchronous Networks
Why can’t we just solve this at the hardware level and make network reliable so that software doesn’t need to worry about it?
Imagine if it could be like phone calls on a fixed telephone network (no VoIP, non-cellular)
Phones use circuits that set a fixed amount of data to be sent. This is synchronous
Can we not make network delays predictable
Not gonna write this but TLDR -> Not really
Unreliable Clocks
On distributed systems, each node can have their own clock time
Time can be important
- When has request timed out
- What is 99th percentile response time fo this service
- How long did the user spend on the site
- Ticketmaster problem
- etc.
Time is hard because communication is not instant Delays due to network. Each machine on the network has it’s on clock, usually a quartz crystal oscillator.
Monotonic Versus Time-of-Day Clocks
Computers have two types of clocks NTP -> Network Time Protocol
Time of Day
- Sync to NTP and based on Gregorian calendar
Monotonic Clocks
- Good for measuring duration
- Difference between two checks of the running timer can give info
- Don’t work among multiple CPUs with their own processes
- Usually fine to use for measuring timeouts in distributed systems
Clock Synchronization and Accuracy
Clock drift is a thing in quartz clocks.
- Google assumes about a 200ppm for its servers or like 6ms of drift (assuming 30 second resynchronization)
- About 17 seconds for a clock resynchronized every day
- Leap Seconds are a thing and can crash entire systems
- node can be firewalled off from NTP by accident
- Network delay can mess things up
MiFID II draft European Regulation for high frequency trading requires insane levels of sync’d clocks (100 ms) to help detect market manipulation and flash crash
- They achieve this using GPS receivers, Precision Time Protocol, and careful deployment and monitoring
- Requires a lot of effort and expertise, plenty of ways to go wrong
Relying on Synchronized Clocks
Problem is that time is insanely complicated
Another problem is that incorrect clocks easily go unnoticed. Most things work fine if a clock is slightly off, but it’ll continue to drift and then cause a problem
If you have software that requires synchronized clocks, it’s crucial to monitior clock offsets between all machines.
If it’s too far drifted, need to declare machine as dead and remove it.
Timestamps for ordering events
Let’s say we have the ticket master problem and several people buy tickets. Who gets to go first in the distributed system?
Time of day is terrible here
Last Write Wins is a widely used conflict resolution strategy
- Problems still occur
- Database writes can disappear
- LWW can’t distinguish between writes that occur in quick succession sequentially
- Two nodes can generate the same timestamp, additional tie breaker value is required
Is it possible to make NTP synch accurate enough to prevent incorrect orderings? Probably not
For correct ordering, you need a more accurate clock source.
- Logical Clocks, are based on incrementing counters, rather than osciallating quartz crystal
- Safer for ordering events
- Don’t measure time of day, just order of events
Clock Readings have confidence interval
A clock reading has a ton of digits up to the nanoseconds, but it can be way off
It’s more like a range of possible times. It can be 95% accurate that it’s within 10.3 and 10.5 seconds
Synchronized clocks for global snapshots
Snapshot isolation is done by incrementing transaction ID, but what do we do in a distributed environment
- Pretty sure Xu just gives each cluster their own hash / ID (Used in Twitter Snowflake ID I think)
- And then they have their own increments
It’s possible to use timestamps from the clocks as transaction IDs
- Need to handle uncertainty though
Spanner uses clock’s confidence interval based on TrueTime API
- Spanner with wait the length of the confidence interval to perform the write Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized
This is an area of active research but haven’t been implemented outside of Google
Process Pauses
Let’s consider another dangerous example of clocks in distributed systems
In leader follower, how does a Leader know that it’s still a leader? That it hasn’t been declared dead by the others
One option is for leader to get lease from other nodes. Only one node can hold the lease
- Must periodically renew it’s lease
- Can renew lease every 10 seconds (But this relies on the clock)
If the node unexpectedly pauses for 10 seconds, then it loses it’s lease
- This happens often with garabge collection in many programming languages
- Not to mention virtual environments
- User closes laptop and suspends
- OS does context switch to another thread
Node must assume that execution can be paused for a significant length of time at any point even in the middle of a function call
Response time guarantees
You don’t want to have a pause occur during a car crash to drop out the air bags
There are ways to guarantee real-time for hardware. It’s a lot more work though
You can also limit the impact of Garbage Collection
Knowledge, Truth, Lies
What do we even know about Distributed Systems then?
We can just make assumptions about a system and state the behavior that will happen
Truth Is Defined by the Majority
A node cannot trust it’s own judgement Rely on quorum, voting to make decisions
Leader and the lock
skipping…
Fencing Tokens
skipping…
Byzantine Faults
skipping…