So far, book has been about describing these as they are at present. In final chapter, we’ll shift our focus toward the future and discuss how things should be
Opinions and speculation about future are subjective. Kleppmann will use first person to talk about his opinion, and he welcomes disagreement
Goal is to create reliable, scalable, and maintainable applications and systems. How do we design applications that are better than the ones of today
- More robust, correct, evolvable, and beneficial to humanity
Data Integration
Recurring theme in book is that every solution has different pros and cons.
Log-structure vs. B-Trees vs. Column oriented storage.
Single-leader, multi-leader, and leaderless approaches to replication
There’s a ton of software and tools that ned to be considered for a specific application
First challenge is then to figure out mapping between software products and what circumstances their good in. Vendors will only tell you the good things
Previous chapters should equip you to be able to ask questions to read between the lines and understand tradeoffs
Another problem though, in complex applications data is often used in several different ways. One software can’t work for all different circumstances
Combining Specialized Tools by Deriving Data
One common case is needing to do search queries, so you have an OLTP database with fulll text search index. PostgresSQL can do this but only works for simple applications
Probably need someone like ElasticSearch for bigger ones.
Can keep data in different locations to utilize several tools, like data warehouses, database, and search index
Kleppmann is surprised to see SWEs say statements like “In my experience, 99% of people only need X” or “…don’t need X”
Reasoning about dataflow
How do you get data into all the right places, in the right formats?
For example, database might be arranged to be written into system of record database, capturing changes to that DB (Change Capture Data) can be recorded and then applied to the search index in the same order.
Writing to database is the only way of supplying new input into this system
Can’t really allow application to do dual writes as that has problems discussed in earlier chapters.
Can funnel all user input through a single system that decides the ordering for all writes, and makes it easier to derive other representations of the data by processing the writes in the same order
Derived Data Versus Distributed Transactions
Difference between Distributed Transactions and Derived Data (Have similar goals)
Transaction systems provide strong consistency
- Derived Systems are often updated asynchronously
Within limited environments that are willing to pay the cost of distributed transactions, they have been used succesfully
Kleppman thinks XA has poor fault tolerance and performance characteristics
- Thinks it’s possible to create a better protocol for distributed transactions, but getting such a protocol widely adopted and integrated with existing tools would be challenging
Without a good distributed transactions protocol, Kleppmann thinks log-based derived data is very promising for integrating different data systems.
Limits of Total Ordering
Totally ordered event log is possible on small systems (Single-Leader Replication does this)
Limits start for bigger and more complex workloads
- Total order requires passing all events through a single leader node, but that has a throughput limit
- Severs spread across multiple Geographically distributed datacenters
- Need a seperate leader on each datacenter in case one goes down
Applications as microservices have their own durable state deploy as single unit
- Not shared between services typically
Some applications maintain client-side state that is updated immediately on user input
- Client and server will see different orders of events
Need a good consensus algorithm
Ordering events to capture causality
Logical timestamps
Log an event to record the state of system user saw before making decision
conflict resolution algorithms
Batch and Stream Processing
Spark performs stream processing on top of batch processing engine
Flink performs batch processing on top of stream processing engine.
Maintaining derived state
Batch encourages deterministic, pure functions Functions have outputs that only depend on the inputs with no side effect as objects are immutable
Derived views allow gradual evolution. If you want to restructure dataset, you do not need to perform the migration as a sudden switch
Can maintain the old and new schema side by side as two independent derived views of the same underlying data
Every stage can easily go back
Lambda Architecture
Combine batch and stream processing
This was briefly mentioned in Xu’s book too
Idea is that incoming data should be recorded by appending immutable events to an always growing dataset, similar to even sourcing.
Read optimized views are derived. Lambda architecture has two different systems in parallel, batch and streaming
Stream will approximate update to view. Batch will correct the approximated data later
Unifying batch and stream processing
Recent work has made the benefits of lambda architecture to have no downsides
Unbundling Databases
Unix is simpler in that it is a thin wrapper around hardware resources
Relation allows short declarative queries
NoSQL movement would prefer Unix-esque approach to domain of distrubted OLTP storage
Kleppmann will try to reconcile the two philosophies
Composing Data Storage Technologies
Secondary Indexes -> Faster searching for records based on value of field
Materialized Views -> Precomputed cache of query results
Replication Logs -> Copies of data to keep nodes up to date
Full Text Search Indexes -> Allow keyword search in text
Creating an Index
Creating an index is like reprocessing an existing data set to give a new view
- It scans the table, sorts it, and then writes out the index
Meta-database of everything
Kleppmann thinks that dataflow across an entire organization looks like one huge database. Whenever a batch, stream, or ETL process transports data from one place and form to another place and form, it acts like the database subsystem that keep indexes or materialized views up to date
Batch and stream processors are like elaborate implementations of triggers, stored procedures, and materialized view maintenance routines.
Where wil these developments take us in the future?
Kleppmann says two avenues
Federated Databases
- Unifying reads
- Unified query interface to wide variety of underlying storage engines and processing methods
- Polystore -> example is PostgreSQL’s foreign data wrapper feature
Unbundled Databases
- Unifying writes
- Federation addresses read-only querying across different systems
- Can synchronize writes by unbundling database’s index-maintenance features in a way to synchronize writes
- Like Unix tradition of small tools that do things well