Designing Data Intensive Applications - Chapter 4 Encoding and Evolution

Applications always need to change over time

Applications very often need to change data over time.

Old and New data may co-exist in the system at the same time. Need to maintain backward compatibility and forward compatibility

Backward compatibility is typically easier (You know your code and can make sure it works for it) Forward compatibility is difficult because the old code must ignore the data changes that happened in new code

This chapter will look at several formats for encoding data, including JSON, XML, Protocol Buffer, Thrift, and Avro. Then Kleppman will go over how it’s sent over HTTP / RPC / Message Queues

Formats for Encoding Data

Programs usually work with data in two representations

Translation is called encoding / serialization

Language Specific Formats

Languages have support for encoding in-memory into byte sequences

JSON, XML, and Binary Variants

JSON,XML, and CSV are textual formats and are somewhat human-readable

Binary Encoding

For data that’s only internally used, you can use the most compact and fastest encoding format

There exist binary encoding for JSON

Have to decide if saving like 20% of space is worth having less human readable format

Thrift and Protocol Buffers

Binary encoding libraries that are based on same idea

Define an object schema and then it’ll convert to binary based on it

struct Person {
  1: required string       userName,
  2: optional i64          favoriteNumber,
  3: optional list<string> interests
}

Has forward and backwards compatibility

Avro

Also uses a schema to specify structure of data being encoded Uses a writers schema and reader’s schema

Dataflow -> HTTPS / RPC / …

This Chapter will go over common ways of moving this encoded data to where it needs to go

Databases / Service Calls (RPC) / Async Message Parsing (Message Queue)

Dataflow Through Databases

Process that writes to a database is doing encoding Process that reads from DB does decoding

Need to be careful of losing fields

He mentions something about multiple writes / schema changes

Can do Datadumping at various stages to maintain snapshots of your database

Dataflow Through Services (REST / RPC)

Clients make request sto servers

Web Browsers and Native applications have to agree with what the server is sending

Servers can also make requests to other servers

Goal is to make applications easier to change and maintain by having teams own each service

Web Services

Remote Procedure Calls

Message Queues

Distributed Actor Frameworks