Designing Data Intensive Applications - Chapter 12 Future of Data Systems

So far, book has been about describing these as they are at present. In final chapter, we’ll shift our focus toward the future and discuss how things should be

Opinions and speculation about future are subjective. Kleppmann will use first person to talk about his opinion, and he welcomes disagreement

Goal is to create reliable, scalable, and maintainable applications and systems. How do we design applications that are better than the ones of today

More robust, correct, evolvable, and beneficial to humanity

Data Integration

Recurring theme in book is that every solution has different pros and cons.

Log-structure vs. B-Trees vs. Column oriented storage.

Single-leader, multi-leader, and leaderless approaches to replication

There’s a ton of software and tools that ned to be considered for a specific application

First challenge is then to figure out mapping between software products and what circumstances their good in. Vendors will only tell you the good things

Previous chapters should equip you to be able to ask questions to read between the lines and understand tradeoffs

Another problem though, in complex applications data is often used in several different ways. One software can’t work for all different circumstances

Combining Specialized Tools by Deriving Data

One common case is needing to do search queries, so you have an OLTP database with fulll text search index. PostgresSQL can do this but only works for simple applications

Probably need someone like ElasticSearch for bigger ones.

Can keep data in different locations to utilize several tools, like data warehouses, database, and search index

Kleppmann is surprised to see SWEs say statements like “In my experience, 99% of people only need X” or “…don’t need X”

Reasoning about dataflow

How do you get data into all the right places, in the right formats?

For example, database might be arranged to be written into system of record database, capturing changes to that DB (Change Capture Data) can be recorded and then applied to the search index in the same order.

Writing to database is the only way of supplying new input into this system

Can’t really allow application to do dual writes as that has problems discussed in earlier chapters.

Can funnel all user input through a single system that decides the ordering for all writes, and makes it easier to derive other representations of the data by processing the writes in the same order

Derived Data Versus Distributed Transactions

Difference between Distributed Transactions and Derived Data (Have similar goals)

Transaction systems provide strong consistency

Derived Systems are often updated asynchronously

Within limited environments that are willing to pay the cost of distributed transactions, they have been used succesfully

Kleppman thinks XA has poor fault tolerance and performance characteristics

Thinks it’s possible to create a better protocol for distributed transactions, but getting such a protocol widely adopted and integrated with existing tools would be challenging

Without a good distributed transactions protocol, Kleppmann thinks log-based derived data is very promising for integrating different data systems.

Limits of Total Ordering

Totally ordered event log is possible on small systems (Single-Leader Replication does this)

Limits start for bigger and more complex workloads

Total order requires passing all events through a single leader node, but that has a throughput limit
Severs spread across multiple Geographically distributed datacenters
- Need a seperate leader on each datacenter in case one goes down

Applications as microservices have their own durable state deploy as single unit

Not shared between services typically

Some applications maintain client-side state that is updated immediately on user input

Client and server will see different orders of events

Need a good consensus algorithm

Ordering events to capture causality

Logical timestamps

Log an event to record the state of system user saw before making decision

conflict resolution algorithms

Batch and Stream Processing

Spark performs stream processing on top of batch processing engine

Flink performs batch processing on top of stream processing engine.

Maintaining derived state

Batch encourages deterministic, pure functions Functions have outputs that only depend on the inputs with no side effect as objects are immutable

Derived views allow gradual evolution. If you want to restructure dataset, you do not need to perform the migration as a sudden switch

Can maintain the old and new schema side by side as two independent derived views of the same underlying data

Every stage can easily go back

Lambda Architecture

Combine batch and stream processing

This was briefly mentioned in Xu’s book too

Idea is that incoming data should be recorded by appending immutable events to an always growing dataset, similar to even sourcing.

Read optimized views are derived. Lambda architecture has two different systems in parallel, batch and streaming

Stream will approximate update to view. Batch will correct the approximated data later

Unifying batch and stream processing

Recent work has made the benefits of lambda architecture to have no downsides

Unbundling Databases

Unix is simpler in that it is a thin wrapper around hardware resources

Relation allows short declarative queries

NoSQL movement would prefer Unix-esque approach to domain of distrubted OLTP storage

Kleppmann will try to reconcile the two philosophies

Composing Data Storage Technologies

Secondary Indexes -> Faster searching for records based on value of field

Materialized Views -> Precomputed cache of query results

Replication Logs -> Copies of data to keep nodes up to date

Full Text Search Indexes -> Allow keyword search in text

Creating an Index

Creating an index is like reprocessing an existing data set to give a new view

It scans the table, sorts it, and then writes out the index

Meta-database of everything

Kleppmann thinks that dataflow across an entire organization looks like one huge database. Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it acts like the database subsystem that keep indexes or materialized views up to date

Batch and stream processors are like elaborate implementations of triggers, stored procedures, and materialized view maintenance routines.

Where wil these developments take us in the future?

Kleppmann says two avenues

Federated Databases

Unifying reads
Unified query interface to wide variety of underlying storage engines and processing methods
Polystore -> example is PostgreSQL’s foreign data wrapper feature

Unbundled Databases

Unifying writes
Federation addresses read-only querying across different systems
Can synchronize writes by unbundling database’s index-maintenance features in a way to synchronize writes
Like Unix tradition of small tools that do things well