Designing Data Intensive Applications - Chapter 11 Stream Processing

What do we do when input isn’t bounded (Has finite size)

This chapter will look at event streams as a data management mechanism: the unbounded, incrementally processed counterpart to the batch data we saw in the last chapter.

We will look at how streams are represented, stored, and transmitted over a network. In “Databases and Streams” we will investigate relationship between streams and databases. Finally, in “Processing Streams” Kleppmann will explore approaches and tools for processing those streams continually, and ways that they can be used to build applications

Transmitting Event Streams

In batch processing, inputs and outputs are files. What about for streaming?

When input is a file (sequence of bytes), first step is usually to parse it into a sequence of records. In a stream processing context, the record is known as an event.

Essentially same thing, a small, self-contained, immutable object containing some details
Event usually contains a timestamp indicting when it happened according to a time of day clock

i.e user clicks a button or ad or views a page

Event may be encoded as a text string, or JSON, or perhaps in some binary form

How do we send the messages as soon as they arrive?

DB Triggers are a thing in relational databases, but they’re limited on that they can do

Messaging Systems

Simple way is Unix or TCP connection

Publish / Subscribe model

What to do if producers overload the consumers
- Remove messages, buffer messages in queue, or apply backpressure (block producer from sending)

Direct Messaging from producers to consumers

UDP multi-cast Brokerless messaging libraries like ZeroMQ StasD use unrelibable UDP messaging for collecting metrics from all machines on the network If consumer is exposed on network, can use HTTP or RTC request

This is the idea behind webhooks

Message Brokers

Essentially a DB designed for sending messages

Skipping because I’ve read most of this in Alex Xu’s book

What to do when memory is almost full?

Consumer offsets What to do when consumers are overloaded?

Databases and Streams

We’ve drawn comparisons between message brokers and databases, even though traditionally, they’re considered different categories of tools.

We saw that log-based message brokers have been successful in taking ideas from databases and applying them to messaging.

We can go in reverse, take ideas from messaging streams and apply them to databases

We can store events like users clicking a button or sensor readings, but we can also store events like a write to a DB

A replication log is a stream of database write requests, produced by the leader. The followers apply that stream of writes to their own copy of the database.

state machine replication principle in “Total Order Broadcast”, which states, if every event is a write to databases and every replica processes same events in same order, replicas will all end up in final state (assuming processing an event is deterministic)

Keeping Systems in Sync

Applications must combine different systems. No single systems wins at everything.

We can use data warehouses and then query them

If full periodic database dump is too slow. Sometimes, dual writes can be done

Application writes to each system when data changes

Dual writes can suffer from race conditions though if two systems want to write into each other at the same time

Some writes may fail while other succeeds too

Change Data Capture

Problem with DB replication logs is that their only internal exposed. Client can’t query it directly.

More recently, DB’s are allowing this. Called, Change Data Capture (CDC)

Implementing Change Data Capture

Log consumers can be called derived data systems

CDC makes one database a leader and turns others into followers. Database triggers can be used to implement change data capture

Tend to be more fragile and have performance overheads
Parsing replication log can be more robust approach

LinkedIn’s Databus, Facebook’s Wormhole, and Yahoo’s Sherpa use this idea at large scale.

Initial Snapshot

Having a log of all changes ever made allows reconstruction of the DB

Keeping all changes forever would require too much disk space though. And replaying woudl take too long. Log needs to be truncated

Need to start with a consistent snapshot instead of storing everything Snapshot of DB must correspond to a known position / offset in change log, some CDC tools allow this snapshop facilitiy

Log Compaction

Storage engine goes through records with the same key, and throws away duplicates and keeps the end result on that key. This runs in background several times

If a key is deleted, it marks it and removes all instances of it

Supported in Apache Kafka

API Support for change streams

DB’s are supporting Change Streams as first-class interface now

ReThinkDB allows queries to subscribe to notifications when results of query change Firebase and CouchDB provide data synchronization based on change feed

VoltDB allows transactions to continuously export data from DB in form of stream

Event Sourcing

Developed in Domain-Driven-Design (DDD) community. Will talk it briefly because it incorporates useful and relevant ideas for streaming systems

Event sourcing involves storing all changes to the application state as a log (like CDC)

Difference is:

Data is used as mutable way in CDC
In event sourcing, application logic is built on basis of immutable events

Powerful technique in data modeling

Easier to evolve applications over time
Helps with debugging
Guards against application bugs

Deriving current state from the event log

Event Log must be converted into some version of the application state to be helpful to users

Shopping cart user might want to see what they added in the past

Replaying logs is option like in CDC

Log Compaction needs to be handled differently

Events are modeled at higher level, os event expresses intent of user action, not mechanics of the state update
Later events typically do not override prior events, so you need full history of events to reconstruct final state. Log Compaction is not possible in the same way

Commands and Events

In event sourcing, there’s a difference between events and commands

Commands are like requests that can be rejected Events are within the constraints and are accepted

State, Streams and Immutability

Immutability benefits

Good for financial bookkeeping

Deploying buggy code can never overwrite old data

Deriving several views from same event log

Druid for data analytics can ingest directly from Kafka Kafka Connect sinks can export data from Kafka to several different databases

Concurrency control

Biggest downside to event sourcing and CDC is consumers of event log are asynchronous, so user may make a write to the log, and then read form a log-derived view

Limitations of immutability

Version control systems like Git, Mercurial, and Fossil also rely on immutable data to preserve version history of files

Truly deleting data is hard, since copies can live in many places. SSDs, storage engineers, and filesystems often just write to a new location rather than overwriting. Backups are often deliberately immutably to prevent accidentally deletion or corruption

Processing Streams

What can you do with streams once you have it?

Three options broadly

Take data in events and write to data store / DB /cache/ search index
Push events to users in some way (email, push notification)
Produce one or many output streams

Rest of this chapter will discuss option 3, making other streams

Code that processes streams is known as an operator or job, very similar to Unix and MapReduce

Big different is just that a stream never ends

Uses of Stream Processing

Stream Processing has long been used for monitoring purposes

Fraud Detection systems need usage pattern
Trading systems examine price change
Manufacturing systems track status of machines
Military and Intelligence systems track activities of potential aggressor

Complex event processing

CEP

Use a high-level declarative query language like SQL, or graphical user interface, to describe the patterns of events that should be detected.

Queries are stored long term, and events continuously flow past them in search of a query that matches even pattern

Esper / IBM InfoSphere Streams / Apama TIBCO StreamBase, anD SQLstream.

Stream analytics

Analytics on streams. Similar to CEP, but analytics don’t care about specific event sequences and more on aggregations and statical metrics overall large number of events

Used sometimes in probabilistic algorithms like Bloom filters or set membership

Many Stream processing frameworks are designed with analytics in mind

Apache Storm, Spark Streaming, Flink, Kafka Streams
Azure Stream Analytics, Google Cloud Dataflow

Maintaining Materialized Views

Having windows of different views of the DB / log of events

Search on Streams

Sometimes want to search for individual events based on complex criteria like ful-text search queries

ElasticSearch is good for this

Percolator feature

Queries are stored and documents run past the queries

Message Passing and RPC

skipping this

Reasoning About Time

Stream processors often need to deal with time

Analytics that frequent time windows like (Over last 5 minutes)

Event Time vs. Processing Time

Processing may be delayed

Queueing
Network Faults
Performance issues
Restart of stream consumer
Reprocessing of past events while recovering from a fault

Message delays can mess up message order

Confusing event time and processing time lead to bad data

Knowing when your ready

Never sure when you have all the data for a window

Time out and declare a window ready after you haven’t seen events for awhile

What if a straggler is there?
Ignore it OR publish a correction

Whose clock to use?

When user clicks? When system receives?

Log three timestamps

Time when it occured
Time when sent to server
Time when it was received by server acoording to server clock

Can offset the processing time difference

Types of Windows

Tumbling Window -> Fixed length, every event only in one window

Hopping Window -> Fixed length, windows can overlap in order to give some smoothing

Sliding Window -> All events that occur within some interval

Sessions window -> No fixed duration, all events for a given user

Stream Joins

New events can appear anytime, Joins are harder

Stream-Stream join (window join)

skipping

Stream-table join(stream enrichment)

skipping

Table table join (materialized view maintenance)

skipping