Designing Data Intensive Applications - Chapter 10 Batch Processes

This final part of the book will examine issues around integrating multiple different data systems, potentially with different data models and optimized for different access patterns into one coherent application architecture.

Systems of Record and Derived Data

On high level, systems that store and process data can be grouped into two broad categories

Systems of Record

Derived System

Overview of Chapters

Chapter 10 Batch Processing

First two parts of the book talked about requests, queries, and responses / results

This isn’t the only way of building a system. Alex Xu mentions these three also

Services (Online Systems)

Batch Processing Systems (Offline Systems)

Stream Processing Systems (Near-Real-time Systems)

MapReduce was published in 2004 and has been implemented in many applications

Batch Processing with Unix Tools

Start with simple example

You have a web server that appends a line to a log file every time it serves a request

Simple Log Analysis

Various tools can take these log files and product nice reports. But let’s make our own.

What if you want to find the top 5 most popular pages (Wow, Top K from Xu)

Kleppmann recommends checking out Unix tools as they can be incredibly powerful

Anyways, we can run those commands and get the top 5 most popular pages from our log file

Chain of Commands versus custom program

Could also write a Ruby or Python script to do the same

There’s a big difference in execution flow when we examine a large file

Sorting versus in-memory aggregation

Ruby would keep an in-memory hash table of URLs, like a Counter()

Unix doesn’t require that. It just sorts everything

Better approach depends on Space vs. time complexity

Unix can scale to massive unique URLs

The Unix Philosophy

This was a key design feature of Unix and it remains astonishingly relevant today.

Philosophy of 1978 went like

Automation, rapid prototyping, incremental iteration, and enjoying experimentation, and breaking down large projects into manageable chunks sounds a lot like Agile and DevOps movements of today. Surprisingly little has changed in 4 decades

A Uniform Interface

Unix is fast because it connects directly with everything using the file system. TCP, Input / Output, device drivers are all easily acessible

Separation of logic and Wiring

Pipes allow you to redirect input and output

Now you can have decoupled programs that can do even more

Main limitations is that it’s hard to have multiple inputs and outputs

Transparency and experimentation

Unix makes it very easy to see what’s happening

MapReduce and Distributed Filesystems

Like Unix tools, but distributed across potentially thousands of machines

Running MapReduce job normally does not modify input.

Map reduce read and write files on distributed filesystem

Hadoop’s implementation of MapReduce, HDFS(Hadoop Distributed File System) There’s QFS, AWS S3, Azure Blob, but we’ll focus on HDFS

HDFS is based on shared-nonthing principle

HDFS has a daemon process running on each machine, exposing a network service taht allows other nodes to access files stored on that machine

MapReduce Job Execution

MapReduce is a programming framework with which you can write code to process large datasets in a distributed filesystem like HDFS

step 1 -> Break files into records step 2 -> Map them step 4 -> Count them to Reduce

To create MapReduce job, implement two callback functions

This is pretty confusing honestly, doesn’t feel like what I thought mapreduce was

Distributed Execution of MapReduce

Mapper and Reducer only operate on one record, so they don’t know where input and output come from. Can work on many machines

Key / Value Pairs must be sorted, but dataset it too large to use conventional algorithm

Each map task partitions it’s output by reducer, based on hash of key. Partitions are written to a sorted file

Then “shuffle” occurs (basically sorting imo)

When data is ready, it tells the reducer that it can start

MapReduce Workflows

MapReduce is chained together. One MapReduce isn’t enough, it’s only give the number of page views per URL.

Hadoop’s doesn’t have support for these workflows. Must explicitly be done. First job will write output to directory in HDFS. Second job will read that directory as input and reduce on that.

Workflows of 50 to 100 jobs are common when building recommendation systems

There’s tool for Hadoop that help with setting up multiple workflows

Reduce-Side Joins and Grouping

We never discussed how joins are actually implemented

Hadoop doesn’t have this though. How can we avoid full table scans on all files?

Example - User Activity

Typical example is that you have a log of events saying User ID clicked button Y Each user_ID also has their own email and DOB

Hadoop can’t access these two databases though

Make a copy of the user database and put it into the same distributed filesystem as the log of user activity events. Then, can use Joins to grab the user database and user activity records with MapReduce

Sort-merge joins

Mapper’s purpose is to extract key and value This key would be user ID and key and activity event as value Another mapper would find user_id as key and DOB as value

Reducer can remove these duplicates or even sort them with a secondary key

Reducer can then perform the join logic. It knows the first key is DOB so it can store it and then iterate over activity events with same user_id outputting viewed-url and viewer-age-in-years.

Don’t fully understand but okay

In sort-merge join, mappers and sorting process make sure all necessary data to perform the join is brought together in same place.

Another way to see this is, mappers “send messages” to reducers. When mapper emits key-value pair, key acts like destination address for all other values.

Group By

another common use of bringing data together pattern

Counting number of records in each group Adding up the values in one particular field Picking top K records

Set up mappers so that key-value pairs they produce use the desired grouping key

Partitioning and sorting process then brings together all the records with the same key in the same reducer

Grouping and Joining look very similar for MapReduce

Grouping is also commonly used in collating all the activity events for a specific user session (sessionization)

Handling Skew

Current idea fails if there’s a large amount of data for a single key.

If a join has hot keys, can use some algorithms to compensate

Map-Side Joins

Join algorithm above did actual join logic in the reducers (reduce-side join)

Mappers can also do it

Reduce side is nice because you don’t need to make any assumptions on input data

If you can make assumptions though, it’s possible to make faster joins on map-side

Broadcast hash join

Simplest way is when there is large dataset joined with a small dataset

Mapper can just read all of the join data it needs from an in-memory hash table.

Broadcast as in, each partition of the large input reads the entirety of small input

Can also store small join in a read-only index on local disk. Remain in OOS page cache, so the random access is as fast as in-memory hash table

I’m gonna skip the map-side joins and parition hasing section

Output of Batch Workflows

What are we running all of these jobs in the first place?

OLTP vs. OLAP

Top K problem Building Search Index’s for Google Key-Value Stores

Philosophy of Batch process output

Same philosophy as Unix

Comparing Hadoop to Distributed Databases

Kind of like a distributed version of Unix

HDFS is the file system and MapReduce is the implementation of Unix

Once MapReduce paper was publish, it wasn’t completely new Massively Parallel Processing (MPP) databases had kind of done it before a decade earlier

HDFS is more general purpose compared to exclusively for analytical processing

Diversity of Storage

Hadoop opened possibility of dumping data into HDFS and then processing it later. MPP require upfront modeling of data and query patterns

Data Warehousing seems to before way better than MPP purist view of strategizing

Diversity of processing models

MPP are monolithic, tightly integrated

SQL gives expressive queries and elegant semantics

SQL can’t express everything easily

Map Reduce gave engineers ability to run their own code over large datasets

Designing for frequent faults

Handling of faults are different. Batch processes care less about faults because users tend to not really be involved

MPP can just abort if a node crashes. And run query again

MapReduce can tolerate failure of a map or reduce task by just retrying that task

Writes data to disk for fault tolerance

MapReduce is better for larger jobs. Redoing large jobs is more costly in MPP

Beyond MapReduce

MapReduce isn’t everything. But the entire chapter was spent because it’s a great learning tool

Simple to understand what it’s doing, not simple to use.

There are more tools that make it easier to use.

MapReduce isn’t the fastest either with all of these abstractions

Materialization of Intermediate State

MapReduce job is independent from every other job. They onlt care about the input and output

What if you have intermediary states and you konw where the input will come from

Dataflow engines

An alternative to MapReduce

Handle entire workflow as one job instead of having a ton of sub jobs

Flink, Spark, Tez

Handle faults by recomputing data

Graphs and Iterative Processing

Spark / Flink / Tez arranged operators in a job as a directed acyclic graph

Not the same as graph processing (Which analyze what graphs contain)

Bulk Syncronous parallel (BSP)

fault tolerance and parallel execution

skipping periodically check verticies at end of iteration

High Level APIs and Languages

Package managers like Maven or NPM

Specialization for different domains

skipping…