Applications always need to change over time
- Features are added / New User requirements come in / Business changes Need to change the data, like adding a new field or modify how object is presented
Applications very often need to change data over time.
- Server-side applications can perform a rolling upgrade
- Deploy to a few nodes at a time and check whether everything is working, then deploy to all
- Client-side -> Some users will never upgrade their software so you need backwards compatibility
Old and New data may co-exist in the system at the same time. Need to maintain backward compatibility and forward compatibility
- Newer code can read data written by older code
- Older code can read data written by newer code
Backward compatibility is typically easier (You know your code and can make sure it works for it) Forward compatibility is difficult because the old code must ignore the data changes that happened in new code
This chapter will look at several formats for encoding data, including JSON, XML, Protocol Buffer, Thrift, and Avro. Then Kleppman will go over how it’s sent over HTTP / RPC / Message Queues
Formats for Encoding Data
Programs usually work with data in two representations
- In memory
- Writing data to a file or send it over network
- Have to encode it in a way the in-memory doesn’t do
Translation is called encoding / serialization
- I feel like this comes up a ton in software engineering
- This is a common problem, so we have a massive amount of libraries and encoding formats to choose from
Language Specific Formats
Languages have support for encoding in-memory into byte sequences
- Python has pickle, Java has Serializable
- Very convenient because objects can easily be saved and restored
- Problems
- Tied to one exact language -> Must commit to using that language forever
- Has Security flaws -> Requires decoding the bytes and instatiating code (An attack can sneak executable code into the bytes)
- Versioning doesn’t exist really for these libraries
- Some libraries have terrible CPU performance
JSON, XML, and Binary Variants
- Going into formats that can be read by any programming language
- Widely used, accepted, and disliked
- XML
- Criticized for being too verbose and unnecessarily complciated
- JSON
- Web browser support
- SImple
- CSV
- Less powerful
JSON,XML, and CSV are textual formats and are somewhat human-readable
- Problems
- Ambiguity around encoding numbers (Can’t differentiate between a string number and int number)
- JSON can’t distinguish between ints and floating point
- Problem with dealing with large numbers, like 2^53
- Lose accuracy
- Twitter has a problem like this where their Tweet API will send two different responses to cover it
- JSON / XML support Unicode but not binary string. Most people get around this by encoding binary data as text using Base64
- Works but increases data size by 33%
- CSV doesn’t have a schema, application must define meaning of row and columns
- It’s a vague format (What to do with values that contain a comma or newline?)
- Not all parsers implement escaping rules
- Overall
- Good enough for many purposes
- People agree on the formats. The best part is that you don’t need to get organizations to agree
Binary Encoding
For data that’s only internally used, you can use the most compact and fastest encoding format
There exist binary encoding for JSON
- MessagePack, BSON, BJSON, BISON, and Smile are some examples
Have to decide if saving like 20% of space is worth having less human readable format
Thrift and Protocol Buffers
Binary encoding libraries that are based on same idea
Define an object schema and then it’ll convert to binary based on it
struct Person {
1: required string userName,
2: optional i64 favoriteNumber,
3: optional list<string> interests
}
Has forward and backwards compatibility
- Kleppman gives examples of how this works, but I won’t take notes on it now
Avro
Also uses a schema to specify structure of data being encoded Uses a writers schema and reader’s schema
- Don’t need to be the same I’m also going to skip this section on how it’s implemented
Dataflow -> HTTPS / RPC / …
This Chapter will go over common ways of moving this encoded data to where it needs to go
Databases / Service Calls (RPC) / Async Message Parsing (Message Queue)
Dataflow Through Databases
Process that writes to a database is doing encoding Process that reads from DB does decoding
Need to be careful of losing fields
- Data written by new version that has fields
- Read and decoded into an object with those fields
- But it reads and loses the old field that it had
- Updates to DB from that app will have a missing field now
He mentions something about multiple writes / schema changes
- Can use NULL entires for future rows if fields don’t exist anymore Assure forward and backward compatibility
Can do Datadumping at various stages to maintain snapshots of your database
- Data dump will take the latest schema
Dataflow Through Services (REST / RPC)
Clients make request sto servers
- Have some API key / method call (GET/ POST)
- Server will make a response and send data back
Web Browsers and Native applications have to agree with what the server is sending
Servers can also make requests to other servers
- Service Oriented Architecture (SOA) or Microservice Architecture
Goal is to make applications easier to change and maintain by having teams own each service
- Don’t need to coordinate with other teams to make their changes
- More isolation
Web Services
- HTTP Calls
- REST vs. SOAP
- REST uses HTTP
- SOAP uses XML
- Uses WSDL (Not human readable)
- SOAP relies on heavy tool support, code generation, and programming language support
- Fallen out of favor for small companies / non-enterprise companies
Remote Procedure Calls
- Also generally not used
- Have problems
- JavaBeans only works with Java
- CORBA is insanely complex
- DCOM is only for Microsoft
- Other problems
- Based on idea of local function calls
- Only allows full success or full failure, nothing else
- Must predict network problems
- Doesn’t allow for timeouts
- No way of knowing if something timed out
- Doesn’t tell you if your response was lost and actualy had succeeded
- Network requests take time to pass, local functions are always the same time
- RPC must translate differences in programming languages in client and service
- Is not going away
- gRPC supports streams
- Has a lot of new things coming out to realize that network calls and local function calls are different
- RPC has built in support for Avro encoding binary stuff
- Support service discovery
Message Queues
- Similar to RPC in that it’s sent with low latency
- Use a message broker
- Retry’s / Can guarantee message is always sent / …
- Read Alex Xu for more info
Distributed Actor Frameworks
- Actor Model is programming model for concurrency
- Each actor is a client that has its own local state