Designing Data Intensive Applications - Chapter 10 Batch Processes

This final part of the book will examine issues around integrating multiple different data systems, potentially with different data models and optimized for different access patterns into one coherent application architecture.

Systems of Record and Derived Data

On high level, systems that store and process data can be grouped into two broad categories

Systems of Record

Source of Truth
Authoritative version of your data
First written here and only once (normalized)

Derived System

Take data, and then transform it into another system using processing and transformation
Can recreate it by going back to the soruce

Overview of Chapters

Chapter 10 will go into Batch dataflow systems such as MapReduce
Chapter 11 will take those ideas and apply them to data streams
Chapter 12 concludes the book by exploring ideas on how to use these tools to build reliable, scalable, and maintainable applications in the future

Chapter 10 Batch Processing

First two parts of the book talked about requests, queries, and responses / results

A lot of web servers work this way
This is an online system
Usually there’s some web browser requesting a page or service calling a remote API and generally it’s trigger by a human user and the user waits for a response

This isn’t the only way of building a system. Alex Xu mentions these three also

Services (Online Systems)

Service waits for a request or instruction from client. It tries to quickly handle request once arrives. Response time is the main measure of performance also availability

Batch Processing Systems (Offline Systems)

Takes a large amount of input data, runs a job to process it, and produces some output
Take awhile to run often (few minutes to several days)
Usually scheduled to run periodically
Primary performance measure is throughput
Examples are like the Metric Monitor / Aggregator and Ad Click aggregator / Top K from Xu

Stream Processing Systems (Near-Real-time Systems)

Somewhere between online and offline systems

MapReduce was published in 2004 and has been implemented in many applications

Batch Processing with Unix Tools

Start with simple example

You have a web server that appends a line to a log file every time it serves a request

Simple Log Analysis

Various tools can take these log files and product nice reports. But let’s make our own.

What if you want to find the top 5 most popular pages (Wow, Top K from Xu)

Kleppmann recommends checking out Unix tools as they can be incredibly powerful

aws / sed / grep / sort / uniq / xargs perform well

Anyways, we can run those commands and get the top 5 most popular pages from our log file

Chain of Commands versus custom program

Could also write a Ruby or Python script to do the same

There’s a big difference in execution flow when we examine a large file

Sorting versus in-memory aggregation

Ruby would keep an in-memory hash table of URLs, like a Counter()

Unix doesn’t require that. It just sorts everything

Better approach depends on Space vs. time complexity

1GB of memory can store quite a bit

Unix can scale to massive unique URLs

The Unix Philosophy

This was a key design feature of Unix and it remains astonishingly relevant today.

Philosophy of 1978 went like

Make each program do one thing well. To do new job, make a new program rather than add new features to old programs and complicate them
Expect the output of every program to become input of another. Don’t clutter output with extra info
Design and Build software to be tried early, ideally within weeks. Throw away clumsy parts and rebuild them
Use tools in preference to unskilled help to lighten a programming task, even if you have to build tools and expect to throw some of them away after you finish

Automation, rapid prototyping, incremental iteration, and enjoying experimentation, and breaking down large projects into manageable chunks sounds a lot like Agile and DevOps movements of today. Surprisingly little has changed in 4 decades

A Uniform Interface

Unix is fast because it connects directly with everything using the file system. TCP, Input / Output, device drivers are all easily acessible

Separation of logic and Wiring

Pipes allow you to redirect input and output

Now you can have decoupled programs that can do even more

Main limitations is that it’s hard to have multiple inputs and outputs

Transparency and experimentation

Unix makes it very easy to see what’s happening

input files are immutable. Can run commands as often as you want without damaging the input files
Can end the pipline at any point, pipe the output into less, and look at it to see if it has the expected form (Great for debugging)
Write the output of one pipeline stage to a file and use that file as input to next stage
- Restart stage without rerunning entire pipeline

MapReduce and Distributed Filesystems

Like Unix tools, but distributed across potentially thousands of machines

Running MapReduce job normally does not modify input.

Map reduce read and write files on distributed filesystem

Hadoop’s implementation of MapReduce, HDFS(Hadoop Distributed File System) There’s QFS, AWS S3, Azure Blob, but we’ll focus on HDFS

HDFS is based on shared-nonthing principle

No special hardware needed, only computers connected by conventional datacenter network

HDFS has a daemon process running on each machine, exposing a network service taht allows other nodes to access files stored on that machine

Central server is called NameNode
- Keeps track of which file blocks are stored on which machine
Replication is also done to handle failures
- Can use the erasure coding scheme too (I remember this)

MapReduce Job Execution

MapReduce is a programming framework with which you can write code to process large datasets in a distributed filesystem like HDFS

step 1 -> Break files into records step 2 -> Map them step 4 -> Count them to Reduce

To create MapReduce job, implement two callback functions

Mapper -> Called once for every inptu record
- Extracts key and value
Reducer -> Key-value pairs and collects values with same key

This is pretty confusing honestly, doesn’t feel like what I thought mapreduce was

Distributed Execution of MapReduce

Mapper and Reducer only operate on one record, so they don’t know where input and output come from. Can work on many machines

Key / Value Pairs must be sorted, but dataset it too large to use conventional algorithm

Each map task partitions it’s output by reducer, based on hash of key. Partitions are written to a sorted file

Then “shuffle” occurs (basically sorting imo)

When data is ready, it tells the reducer that it can start

MapReduce Workflows

MapReduce is chained together. One MapReduce isn’t enough, it’s only give the number of page views per URL.

Hadoop’s doesn’t have support for these workflows. Must explicitly be done. First job will write output to directory in HDFS. Second job will read that directory as input and reduce on that.

Workflows of 50 to 100 jobs are common when building recommendation systems

There’s tool for Hadoop that help with setting up multiple workflows

Pig, Hive, Cascading, Crunch, FlumeJava

Reduce-Side Joins and Grouping

We never discussed how joins are actually implemented

Foreign key is specified
Indexing is used to find a record
Multiple index look ups are required for a join

Hadoop doesn’t have this though. How can we avoid full table scans on all files?

Example - User Activity

Typical example is that you have a log of events saying User ID clicked button Y Each user_ID also has their own email and DOB

Hadoop can’t access these two databases though

Make a copy of the user database and put it into the same distributed filesystem as the log of user activity events. Then, can use Joins to grab the user database and user activity records with MapReduce

Sort-merge joins

Mapper’s purpose is to extract key and value This key would be user ID and key and activity event as value Another mapper would find user_id as key and DOB as value

Reducer can remove these duplicates or even sort them with a secondary key

Reducer can then perform the join logic. It knows the first key is DOB so it can store it and then iterate over activity events with same user_id outputting viewed-url and viewer-age-in-years.

Don’t fully understand but okay

In sort-merge join, mappers and sorting process make sure all necessary data to perform the join is brought together in same place.

Another way to see this is, mappers “send messages” to reducers. When mapper emits key-value pair, key acts like destination address for all other values.

Group By

another common use of bringing data together pattern

Counting number of records in each group Adding up the values in one particular field Picking top K records

Set up mappers so that key-value pairs they produce use the desired grouping key

Partitioning and sorting process then brings together all the records with the same key in the same reducer

Grouping and Joining look very similar for MapReduce

Grouping is also commonly used in collating all the activity events for a specific user session (sessionization)

Handling Skew

Current idea fails if there’s a large amount of data for a single key.

hot keys, celebrity problem, linchpin objects

If a join has hot keys, can use some algorithms to compensate

Skewed Join method in Pig tries to find hot keys beforehand
- Then breaks the hot keys done
Crunch uses sharded join
- Requires hot keys to explicitly be named

Map-Side Joins

Join algorithm above did actual join logic in the reducers (reduce-side join)

Mappers can also do it

Reduce side is nice because you don’t need to make any assumptions on input data

If you can make assumptions though, it’s possible to make faster joins on map-side

Broadcast hash join

Simplest way is when there is large dataset joined with a small dataset

Mapper can just read all of the join data it needs from an in-memory hash table.

Broadcast as in, each partition of the large input reads the entirety of small input

Can also store small join in a read-only index on local disk. Remain in OOS page cache, so the random access is as fast as in-memory hash table

I’m gonna skip the map-side joins and parition hasing section

Output of Batch Workflows

What are we running all of these jobs in the first place?

OLTP vs. OLAP

Top K problem Building Search Index’s for Google Key-Value Stores

Philosophy of Batch process output

Same philosophy as Unix

Comparing Hadoop to Distributed Databases

Kind of like a distributed version of Unix

HDFS is the file system and MapReduce is the implementation of Unix

Once MapReduce paper was publish, it wasn’t completely new Massively Parallel Processing (MPP) databases had kind of done it before a decade earlier

Gamma, Teradata, and Tandem NonStop SQL were the first

HDFS is more general purpose compared to exclusively for analytical processing

Diversity of Storage

Hadoop opened possibility of dumping data into HDFS and then processing it later. MPP require upfront modeling of data and query patterns

Data Warehousing seems to before way better than MPP purist view of strategizing

careful schema planning slows down centralized data collection

Diversity of processing models

MPP are monolithic, tightly integrated

SQL gives expressive queries and elegant semantics

SQL can’t express everything easily

Map Reduce gave engineers ability to run their own code over large datasets

Hive even built their on SQL on top of it

Designing for frequent faults

Handling of faults are different. Batch processes care less about faults because users tend to not really be involved

MPP can just abort if a node crashes. And run query again

MapReduce can tolerate failure of a map or reduce task by just retrying that task

Writes data to disk for fault tolerance

MapReduce is better for larger jobs. Redoing large jobs is more costly in MPP

But machines rarely fail anyways

Beyond MapReduce

MapReduce isn’t everything. But the entire chapter was spent because it’s a great learning tool

Simple to understand what it’s doing, not simple to use.

There are more tools that make it easier to use.

Pig, Hive Cascading, Crunch were meant to abstract to higher level

MapReduce isn’t the fastest either with all of these abstractions

Materialization of Intermediate State

MapReduce job is independent from every other job. They onlt care about the input and output

What if you have intermediary states and you konw where the input will come from

Dataflow engines

An alternative to MapReduce

Handle entire workflow as one job instead of having a ton of sub jobs

Flink, Spark, Tez

Handle faults by recomputing data

Graphs and Iterative Processing

Spark / Flink / Tez arranged operators in a job as a directed acyclic graph

Not the same as graph processing (Which analyze what graphs contain)

Bulk Syncronous parallel (BSP)

Spark has GraphX API, Flink has Gelly API
Help process Graph data

fault tolerance and parallel execution

skipping periodically check verticies at end of iteration

High Level APIs and Languages

Hive, Pig, Crunch

Package managers like Maven or NPM

Specialization for different domains

skipping…