This final part of the book will examine issues around integrating multiple different data systems, potentially with different data models and optimized for different access patterns into one coherent application architecture.
Systems of Record and Derived Data
On high level, systems that store and process data can be grouped into two broad categories
Systems of Record
- Source of Truth
- Authoritative version of your data
- First written here and only once (normalized)
Derived System
- Take data, and then transform it into another system using processing and transformation
- Can recreate it by going back to the soruce
Overview of Chapters
- Chapter 10 will go into Batch dataflow systems such as MapReduce
- Chapter 11 will take those ideas and apply them to data streams
- Chapter 12 concludes the book by exploring ideas on how to use these tools to build reliable, scalable, and maintainable applications in the future
Chapter 10 Batch Processing
First two parts of the book talked about requests, queries, and responses / results
- A lot of web servers work this way
- This is an online system
- Usually there’s some web browser requesting a page or service calling a remote API and generally it’s trigger by a human user and the user waits for a response
This isn’t the only way of building a system. Alex Xu mentions these three also
Services (Online Systems)
- Service waits for a request or instruction from client. It tries to quickly handle request once arrives. Response time is the main measure of performance also availability
Batch Processing Systems (Offline Systems)
- Takes a large amount of input data, runs a job to process it, and produces some output
- Take awhile to run often (few minutes to several days)
- Usually scheduled to run periodically
- Primary performance measure is throughput
- Examples are like the Metric Monitor / Aggregator and Ad Click aggregator / Top K from Xu
Stream Processing Systems (Near-Real-time Systems)
- Somewhere between online and offline systems
MapReduce was published in 2004 and has been implemented in many applications
Batch Processing with Unix Tools
Start with simple example
You have a web server that appends a line to a log file every time it serves a request
Simple Log Analysis
Various tools can take these log files and product nice reports. But let’s make our own.
What if you want to find the top 5 most popular pages (Wow, Top K from Xu)
Kleppmann recommends checking out Unix tools as they can be incredibly powerful
- aws / sed / grep / sort / uniq / xargs perform well
Anyways, we can run those commands and get the top 5 most popular pages from our log file
Chain of Commands versus custom program
Could also write a Ruby or Python script to do the same
There’s a big difference in execution flow when we examine a large file
Sorting versus in-memory aggregation
Ruby would keep an in-memory hash table of URLs, like a Counter()
Unix doesn’t require that. It just sorts everything
Better approach depends on Space vs. time complexity
- 1GB of memory can store quite a bit
Unix can scale to massive unique URLs
The Unix Philosophy
This was a key design feature of Unix and it remains astonishingly relevant today.
Philosophy of 1978 went like
- Make each program do one thing well. To do new job, make a new program rather than add new features to old programs and complicate them
- Expect the output of every program to become input of another. Don’t clutter output with extra info
- Design and Build software to be tried early, ideally within weeks. Throw away clumsy parts and rebuild them
- Use tools in preference to unskilled help to lighten a programming task, even if you have to build tools and expect to throw some of them away after you finish
Automation, rapid prototyping, incremental iteration, and enjoying experimentation, and breaking down large projects into manageable chunks sounds a lot like Agile and DevOps movements of today. Surprisingly little has changed in 4 decades
A Uniform Interface
Unix is fast because it connects directly with everything using the file system. TCP, Input / Output, device drivers are all easily acessible
Separation of logic and Wiring
Pipes allow you to redirect input and output
Now you can have decoupled programs that can do even more
Main limitations is that it’s hard to have multiple inputs and outputs
Transparency and experimentation
Unix makes it very easy to see what’s happening
- input files are immutable. Can run commands as often as you want without damaging the input files
- Can end the pipline at any point, pipe the output into less, and look at it to see if it has the expected form (Great for debugging)
- Write the output of one pipeline stage to a file and use that file as input to next stage
- Restart stage without rerunning entire pipeline
MapReduce and Distributed Filesystems
Like Unix tools, but distributed across potentially thousands of machines
Running MapReduce job normally does not modify input.
Map reduce read and write files on distributed filesystem
Hadoop’s implementation of MapReduce, HDFS(Hadoop Distributed File System) There’s QFS, AWS S3, Azure Blob, but we’ll focus on HDFS
HDFS is based on shared-nonthing principle
- No special hardware needed, only computers connected by conventional datacenter network
HDFS has a daemon process running on each machine, exposing a network service taht allows other nodes to access files stored on that machine
- Central server is called NameNode
- Keeps track of which file blocks are stored on which machine
- Replication is also done to handle failures
- Can use the erasure coding scheme too (I remember this)
MapReduce Job Execution
MapReduce is a programming framework with which you can write code to process large datasets in a distributed filesystem like HDFS
step 1 -> Break files into records step 2 -> Map them step 4 -> Count them to Reduce
To create MapReduce job, implement two callback functions
- Mapper -> Called once for every inptu record
- Extracts key and value
- Reducer -> Key-value pairs and collects values with same key
This is pretty confusing honestly, doesn’t feel like what I thought mapreduce was
Distributed Execution of MapReduce
Mapper and Reducer only operate on one record, so they don’t know where input and output come from. Can work on many machines
Key / Value Pairs must be sorted, but dataset it too large to use conventional algorithm
Each map task partitions it’s output by reducer, based on hash of key. Partitions are written to a sorted file
Then “shuffle” occurs (basically sorting imo)
When data is ready, it tells the reducer that it can start
MapReduce Workflows
MapReduce is chained together. One MapReduce isn’t enough, it’s only give the number of page views per URL.
Hadoop’s doesn’t have support for these workflows. Must explicitly be done. First job will write output to directory in HDFS. Second job will read that directory as input and reduce on that.
Workflows of 50 to 100 jobs are common when building recommendation systems
There’s tool for Hadoop that help with setting up multiple workflows
- Pig, Hive, Cascading, Crunch, FlumeJava
Reduce-Side Joins and Grouping
We never discussed how joins are actually implemented
- Foreign key is specified
- Indexing is used to find a record
- Multiple index look ups are required for a join
Hadoop doesn’t have this though. How can we avoid full table scans on all files?
Example - User Activity
Typical example is that you have a log of events saying User ID clicked button Y Each user_ID also has their own email and DOB
Hadoop can’t access these two databases though
Make a copy of the user database and put it into the same distributed filesystem as the log of user activity events. Then, can use Joins to grab the user database and user activity records with MapReduce
Sort-merge joins
Mapper’s purpose is to extract key and value This key would be user ID and key and activity event as value Another mapper would find user_id as key and DOB as value
Reducer can remove these duplicates or even sort them with a secondary key
Reducer can then perform the join logic. It knows the first key is DOB so it can store it and then iterate over activity events with same user_id outputting viewed-url and viewer-age-in-years.
Don’t fully understand but okay
Bringing related data together in the same place
In sort-merge join, mappers and sorting process make sure all necessary data to perform the join is brought together in same place.
Another way to see this is, mappers “send messages” to reducers. When mapper emits key-value pair, key acts like destination address for all other values.
Group By
another common use of bringing data together pattern
Counting number of records in each group Adding up the values in one particular field Picking top K records
Set up mappers so that key-value pairs they produce use the desired grouping key
Partitioning and sorting process then brings together all the records with the same key in the same reducer
Grouping and Joining look very similar for MapReduce
Grouping is also commonly used in collating all the activity events for a specific user session (sessionization)
Handling Skew
Current idea fails if there’s a large amount of data for a single key.
- hot keys, celebrity problem, linchpin objects
If a join has hot keys, can use some algorithms to compensate
- Skewed Join method in Pig tries to find hot keys beforehand
- Then breaks the hot keys done
- Crunch uses sharded join
- Requires hot keys to explicitly be named
Map-Side Joins
Join algorithm above did actual join logic in the reducers (reduce-side join)
Mappers can also do it
Reduce side is nice because you don’t need to make any assumptions on input data
If you can make assumptions though, it’s possible to make faster joins on map-side
Broadcast hash join
Simplest way is when there is large dataset joined with a small dataset
Mapper can just read all of the join data it needs from an in-memory hash table.
Broadcast as in, each partition of the large input reads the entirety of small input
Can also store small join in a read-only index on local disk. Remain in OOS page cache, so the random access is as fast as in-memory hash table
I’m gonna skip the map-side joins and parition hasing section
Output of Batch Workflows
What are we running all of these jobs in the first place?
OLTP vs. OLAP
Top K problem Building Search Index’s for Google Key-Value Stores
Philosophy of Batch process output
Same philosophy as Unix
Comparing Hadoop to Distributed Databases
Kind of like a distributed version of Unix
HDFS is the file system and MapReduce is the implementation of Unix
Once MapReduce paper was publish, it wasn’t completely new Massively Parallel Processing (MPP) databases had kind of done it before a decade earlier
- Gamma, Teradata, and Tandem NonStop SQL were the first
HDFS is more general purpose compared to exclusively for analytical processing
Diversity of Storage
Hadoop opened possibility of dumping data into HDFS and then processing it later. MPP require upfront modeling of data and query patterns
Data Warehousing seems to before way better than MPP purist view of strategizing
- careful schema planning slows down centralized data collection
Diversity of processing models
MPP are monolithic, tightly integrated
SQL gives expressive queries and elegant semantics
SQL can’t express everything easily
Map Reduce gave engineers ability to run their own code over large datasets
- Hive even built their on SQL on top of it
Designing for frequent faults
Handling of faults are different. Batch processes care less about faults because users tend to not really be involved
MPP can just abort if a node crashes. And run query again
MapReduce can tolerate failure of a map or reduce task by just retrying that task
Writes data to disk for fault tolerance
MapReduce is better for larger jobs. Redoing large jobs is more costly in MPP
- But machines rarely fail anyways
Beyond MapReduce
MapReduce isn’t everything. But the entire chapter was spent because it’s a great learning tool
Simple to understand what it’s doing, not simple to use.
There are more tools that make it easier to use.
- Pig, Hive Cascading, Crunch were meant to abstract to higher level
MapReduce isn’t the fastest either with all of these abstractions
Materialization of Intermediate State
MapReduce job is independent from every other job. They onlt care about the input and output
What if you have intermediary states and you konw where the input will come from
Dataflow engines
An alternative to MapReduce
Handle entire workflow as one job instead of having a ton of sub jobs
Flink, Spark, Tez
Handle faults by recomputing data
Graphs and Iterative Processing
Spark / Flink / Tez arranged operators in a job as a directed acyclic graph
Not the same as graph processing (Which analyze what graphs contain)
Bulk Syncronous parallel (BSP)
- Spark has GraphX API, Flink has Gelly API
- Help process Graph data
fault tolerance and parallel execution
skipping periodically check verticies at end of iteration
High Level APIs and Languages
- Hive, Pig, Crunch
Package managers like Maven or NPM
Specialization for different domains
skipping…