What do we do when input isn’t bounded (Has finite size)
This chapter will look at event streams as a data management mechanism: the unbounded, incrementally processed counterpart to the batch data we saw in the last chapter.
We will look at how streams are represented, stored, and transmitted over a network. In “Databases and Streams” we will investigate relationship between streams and databases. Finally, in “Processing Streams” Kleppmann will explore approaches and tools for processing those streams continually, and ways that they can be used to build applications
Transmitting Event Streams
In batch processing, inputs and outputs are files. What about for streaming?
When input is a file (sequence of bytes), first step is usually to parse it into a sequence of records. In a stream processing context, the record is known as an event.
- Essentially same thing, a small, self-contained, immutable object containing some details
- Event usually contains a timestamp indicting when it happened according to a time of day clock
i.e user clicks a button or ad or views a page
Event may be encoded as a text string, or JSON, or perhaps in some binary form
How do we send the messages as soon as they arrive?
DB Triggers are a thing in relational databases, but they’re limited on that they can do
Messaging Systems
Simple way is Unix or TCP connection
Publish / Subscribe model
- What to do if producers overload the consumers
- Remove messages, buffer messages in queue, or apply backpressure (block producer from sending)
Direct Messaging from producers to consumers
UDP multi-cast Brokerless messaging libraries like ZeroMQ StasD use unrelibable UDP messaging for collecting metrics from all machines on the network If consumer is exposed on network, can use HTTP or RTC request
- This is the idea behind webhooks
Message Brokers
Essentially a DB designed for sending messages
Skipping because I’ve read most of this in Alex Xu’s book
What to do when memory is almost full?
Consumer offsets What to do when consumers are overloaded?
Databases and Streams
We’ve drawn comparisons between message brokers and databases, even though traditionally, they’re considered different categories of tools.
We saw that log-based message brokers have been successful in taking ideas from databases and applying them to messaging.
We can go in reverse, take ideas from messaging streams and apply them to databases
We can store events like users clicking a button or sensor readings, but we can also store events like a write to a DB
A replication log is a stream of database write requests, produced by the leader. The followers apply that stream of writes to their own copy of the database.
state machine replication principle in “Total Order Broadcast”, which states, if every event is a write to databases and every replica processes same events in same order, replicas will all end up in final state (assuming processing an event is deterministic)
Keeping Systems in Sync
Applications must combine different systems. No single systems wins at everything.
We can use data warehouses and then query them
If full periodic database dump is too slow. Sometimes, dual writes can be done
- Application writes to each system when data changes
Dual writes can suffer from race conditions though if two systems want to write into each other at the same time
- Some writes may fail while other succeeds too
Change Data Capture
Problem with DB replication logs is that their only internal exposed. Client can’t query it directly.
More recently, DB’s are allowing this. Called, Change Data Capture (CDC)
Implementing Change Data Capture
Log consumers can be called derived data systems
CDC makes one database a leader and turns others into followers. Database triggers can be used to implement change data capture
- Tend to be more fragile and have performance overheads
- Parsing replication log can be more robust approach
LinkedIn’s Databus, Facebook’s Wormhole, and Yahoo’s Sherpa use this idea at large scale.
Initial Snapshot
Having a log of all changes ever made allows reconstruction of the DB
Keeping all changes forever would require too much disk space though. And replaying woudl take too long. Log needs to be truncated
Need to start with a consistent snapshot instead of storing everything Snapshot of DB must correspond to a known position / offset in change log, some CDC tools allow this snapshop facilitiy
Log Compaction
Storage engine goes through records with the same key, and throws away duplicates and keeps the end result on that key. This runs in background several times
If a key is deleted, it marks it and removes all instances of it
Supported in Apache Kafka
API Support for change streams
DB’s are supporting Change Streams as first-class interface now
ReThinkDB allows queries to subscribe to notifications when results of query change Firebase and CouchDB provide data synchronization based on change feed
VoltDB allows transactions to continuously export data from DB in form of stream
Event Sourcing
Developed in Domain-Driven-Design (DDD) community. Will talk it briefly because it incorporates useful and relevant ideas for streaming systems
Event sourcing involves storing all changes to the application state as a log (like CDC)
Difference is:
- Data is used as mutable way in CDC
- In event sourcing, application logic is built on basis of immutable events
Powerful technique in data modeling
- Easier to evolve applications over time
- Helps with debugging
- Guards against application bugs
Deriving current state from the event log
Event Log must be converted into some version of the application state to be helpful to users
- Shopping cart user might want to see what they added in the past
Replaying logs is option like in CDC
Log Compaction needs to be handled differently
- Events are modeled at higher level, os event expresses intent of user action, not mechanics of the state update
- Later events typically do not override prior events, so you need full history of events to reconstruct final state. Log Compaction is not possible in the same way
Commands and Events
In event sourcing, there’s a difference between events and commands
Commands are like requests that can be rejected Events are within the constraints and are accepted
State, Streams and Immutability
Immutability benefits
Good for financial bookkeeping
Deploying buggy code can never overwrite old data
Deriving several views from same event log
Druid for data analytics can ingest directly from Kafka Kafka Connect sinks can export data from Kafka to several different databases
Concurrency control
Biggest downside to event sourcing and CDC is consumers of event log are asynchronous, so user may make a write to the log, and then read form a log-derived view
Limitations of immutability
Version control systems like Git, Mercurial, and Fossil also rely on immutable data to preserve version history of files
Truly deleting data is hard, since copies can live in many places. SSDs, storage engineers, and filesystems often just write to a new location rather than overwriting. Backups are often deliberately immutably to prevent accidentally deletion or corruption
Processing Streams
What can you do with streams once you have it?
Three options broadly
- Take data in events and write to data store / DB /cache/ search index
- Push events to users in some way (email, push notification)
- Produce one or many output streams
Rest of this chapter will discuss option 3, making other streams
Code that processes streams is known as an operator or job, very similar to Unix and MapReduce
Big different is just that a stream never ends
Uses of Stream Processing
Stream Processing has long been used for monitoring purposes
- Fraud Detection systems need usage pattern
- Trading systems examine price change
- Manufacturing systems track status of machines
- Military and Intelligence systems track activities of potential aggressor
Complex event processing
CEP
Use a high-level declarative query language like SQL, or graphical user interface, to describe the patterns of events that should be detected.
Queries are stored long term, and events continuously flow past them in search of a query that matches even pattern
Esper / IBM InfoSphere Streams / Apama TIBCO StreamBase, anD SQLstream.
Stream analytics
Analytics on streams. Similar to CEP, but analytics don’t care about specific event sequences and more on aggregations and statical metrics overall large number of events
Used sometimes in probabilistic algorithms like Bloom filters or set membership
Many Stream processing frameworks are designed with analytics in mind
- Apache Storm, Spark Streaming, Flink, Kafka Streams
- Azure Stream Analytics, Google Cloud Dataflow
Maintaining Materialized Views
Having windows of different views of the DB / log of events
Search on Streams
Sometimes want to search for individual events based on complex criteria like ful-text search queries
ElasticSearch is good for this
- Percolator feature
Queries are stored and documents run past the queries
Message Passing and RPC
skipping this
Reasoning About Time
Stream processors often need to deal with time
- Analytics that frequent time windows like (Over last 5 minutes)
Event Time vs. Processing Time
Processing may be delayed
- Queueing
- Network Faults
- Performance issues
- Restart of stream consumer
- Reprocessing of past events while recovering from a fault
Message delays can mess up message order
Confusing event time and processing time lead to bad data
Knowing when your ready
Never sure when you have all the data for a window
Time out and declare a window ready after you haven’t seen events for awhile
- What if a straggler is there?
- Ignore it OR publish a correction
Whose clock to use?
When user clicks? When system receives?
Log three timestamps
- Time when it occured
- Time when sent to server
- Time when it was received by server acoording to server clock
Can offset the processing time difference
Types of Windows
Tumbling Window -> Fixed length, every event only in one window
Hopping Window -> Fixed length, windows can overlap in order to give some smoothing
Sliding Window -> All events that occur within some interval
Sessions window -> No fixed duration, all events for a given user
Stream Joins
New events can appear anytime, Joins are harder
Stream-Stream join (window join)
skipping
Stream-table join(stream enrichment)
skipping
Table table join (materialized view maintenance)
skipping