Designing Data Intensive Applications - Chapter 4 Encoding and Evolution

Applications always need to change over time

Features are added / New User requirements come in / Business changes Need to change the data, like adding a new field or modify how object is presented

Applications very often need to change data over time.

Server-side applications can perform a rolling upgrade
- Deploy to a few nodes at a time and check whether everything is working, then deploy to all
Client-side -> Some users will never upgrade their software so you need backwards compatibility

Old and New data may co-exist in the system at the same time. Need to maintain backward compatibility and forward compatibility

Newer code can read data written by older code
Older code can read data written by newer code

Backward compatibility is typically easier (You know your code and can make sure it works for it) Forward compatibility is difficult because the old code must ignore the data changes that happened in new code

This chapter will look at several formats for encoding data, including JSON, XML, Protocol Buffer, Thrift, and Avro. Then Kleppman will go over how it’s sent over HTTP / RPC / Message Queues

Formats for Encoding Data

Programs usually work with data in two representations

In memory
Writing data to a file or send it over network
- Have to encode it in a way the in-memory doesn’t do

Translation is called encoding / serialization

I feel like this comes up a ton in software engineering
This is a common problem, so we have a massive amount of libraries and encoding formats to choose from

Language Specific Formats

Languages have support for encoding in-memory into byte sequences

Python has pickle, Java has Serializable
Very convenient because objects can easily be saved and restored
Problems
- Tied to one exact language -> Must commit to using that language forever
- Has Security flaws -> Requires decoding the bytes and instatiating code (An attack can sneak executable code into the bytes)
- Versioning doesn’t exist really for these libraries
- Some libraries have terrible CPU performance

JSON, XML, and Binary Variants

Going into formats that can be read by any programming language
Widely used, accepted, and disliked
XML
- Criticized for being too verbose and unnecessarily complciated
JSON
- Web browser support
- SImple
CSV
- Less powerful

JSON,XML, and CSV are textual formats and are somewhat human-readable

Problems
- Ambiguity around encoding numbers (Can’t differentiate between a string number and int number)
- JSON can’t distinguish between ints and floating point
- Problem with dealing with large numbers, like 2^53
  - Lose accuracy
  - Twitter has a problem like this where their Tweet API will send two different responses to cover it
- JSON / XML support Unicode but not binary string. Most people get around this by encoding binary data as text using Base64
  - Works but increases data size by 33%
- CSV doesn’t have a schema, application must define meaning of row and columns
  - It’s a vague format (What to do with values that contain a comma or newline?)
  - Not all parsers implement escaping rules
Overall
- Good enough for many purposes
- People agree on the formats. The best part is that you don’t need to get organizations to agree

Binary Encoding

For data that’s only internally used, you can use the most compact and fastest encoding format

There exist binary encoding for JSON

MessagePack, BSON, BJSON, BISON, and Smile are some examples

Have to decide if saving like 20% of space is worth having less human readable format

Thrift and Protocol Buffers

Binary encoding libraries that are based on same idea

Define an object schema and then it’ll convert to binary based on it

struct Person {
  1: required string       userName,
  2: optional i64          favoriteNumber,
  3: optional list<string> interests
}

Has forward and backwards compatibility

Kleppman gives examples of how this works, but I won’t take notes on it now

Avro

Also uses a schema to specify structure of data being encoded Uses a writers schema and reader’s schema

Don’t need to be the same I’m also going to skip this section on how it’s implemented

Dataflow -> HTTPS / RPC / …

This Chapter will go over common ways of moving this encoded data to where it needs to go

Databases / Service Calls (RPC) / Async Message Parsing (Message Queue)

Dataflow Through Databases

Process that writes to a database is doing encoding Process that reads from DB does decoding

Need to be careful of losing fields

Data written by new version that has fields
Read and decoded into an object with those fields
- But it reads and loses the old field that it had
Updates to DB from that app will have a missing field now

He mentions something about multiple writes / schema changes

Can use NULL entires for future rows if fields don’t exist anymore Assure forward and backward compatibility

Can do Datadumping at various stages to maintain snapshots of your database

Data dump will take the latest schema

Dataflow Through Services (REST / RPC)

Clients make request sto servers

Have some API key / method call (GET/ POST)
Server will make a response and send data back

Web Browsers and Native applications have to agree with what the server is sending

Servers can also make requests to other servers

Service Oriented Architecture (SOA) or Microservice Architecture

Goal is to make applications easier to change and maintain by having teams own each service

Don’t need to coordinate with other teams to make their changes
More isolation

Web Services

HTTP Calls
REST vs. SOAP
- REST uses HTTP
- SOAP uses XML
  - Uses WSDL (Not human readable)
  - SOAP relies on heavy tool support, code generation, and programming language support
  - Fallen out of favor for small companies / non-enterprise companies

Remote Procedure Calls

Also generally not used
Have problems
- JavaBeans only works with Java
- CORBA is insanely complex
- DCOM is only for Microsoft
Other problems
- Based on idea of local function calls
- Only allows full success or full failure, nothing else
  - Must predict network problems
- Doesn’t allow for timeouts
  - No way of knowing if something timed out
- Doesn’t tell you if your response was lost and actualy had succeeded
- Network requests take time to pass, local functions are always the same time
- RPC must translate differences in programming languages in client and service
Is not going away
- gRPC supports streams
- Has a lot of new things coming out to realize that network calls and local function calls are different
- RPC has built in support for Avro encoding binary stuff
- Support service discovery

Message Queues

Similar to RPC in that it’s sent with low latency
Use a message broker
Retry’s / Can guarantee message is always sent / …
Read Alex Xu for more info

Distributed Actor Frameworks

Actor Model is programming model for concurrency
Each actor is a client that has its own local state