Alex Xu V2 - 9 S3 like Object Storage

TLDR

Have replicas and nodes so users don’t lose data
Have a coordinator service for nodes
Have DB with meta data and have a pending status

S3 Launched in 2006

Versioning, bucket policy, and multi-part upload support in 2010
Server side encryption, multi-object delete, object expiration in 2011
2 trillion objects stored in 2013
Life Cycle policy, event notification, cross-region replication
100 trillion objects storied in S3 in 2021

Storage System 101

There’s three broad categories of storage

Block Storage File Storage Object Storage

Block Storage

Came out in 1960s
- SSD and HDD are attached to servers
- Most versatile because applications can use the blocks / volumes of storage, or they can use the file system of the OS

File Storage

Built on top of block storage
Data stored as hierarchial structure
SMB/CIFS and NFS are network protocols to transfer and share storage between servers
Has great simplicity to share files within an organizations

Object Storage

New, sacrifices performance for durability
Stores objects in a flat structure, no hierarchy
Data accessed via RESTful APi

![[Pasted image 20240216082313.png]]

![[Pasted image 20240216082328.png]]

Terminology

Bucket -> Container for objects

Object -> Individual piece of data we store in bucket

Versioning -> Multiple variants of an object in the same bucket

Uniform Resource Identifier (URI) -> Object storage provides RESTful APIs

Service - level agreements (SLA) -> Contract between service provider and client

Design for durability of 99.9999999% of objects on multiple availability zones

Step 1 - Understand Problem and Establish Scope

Which features should be included in the design?

Bucket creation
Object uploading and downloading
Object versioning
Listing objects in a bucket, like AWS s3, ls command

What’s typical data size?

Few GBs or more and large number of small objects (tens of KBs) efficiently

How much data do we need to store in one year

100 PB

Can we assume durability is 6 nines? and service availability is 4 nines?

(Note I like this question for showing knowledge of nines)
Sounds reasonable

Non-functional Requirements

100PB of data
Data durability of 6 nines
Service availability of 4 nines
Storage efficiency

Back of Envelope Estimations

Object storage has bottlenecks of disk capacity or disk IO per second

Disk Capacity

Let’s assume
20% are small
60% of objects are medium
20% are large objects

IOPS -> one hard disk can do 100 - 150 random seeks per second (100 - 150 IOPS)

Use median of small, medium, and large object to simplify math 40% usage ratio gives

100PB = 100 x 1000 x 1000 x 1000 MB = 10^11 MB 10^11 x .4 / (.2x.5MB … …. ..) = .68B objects

Probably won’t use these numbers but Xu likes to have numbers

Step 2 -> Propose High level design and get buy-in

Characteristics of Object Storage

Object immutability
Key-Value Store
Write once, read many times
support both small and large objects

File storage in UNIX works by writing inodes that point to where the data is located

![[Pasted image 20240216085007.png]]

Load balancer

API service

Identity and acces management

Data store

Metadata store

Uploading an object

![[Pasted image 20240216090749.png]]

Downloading an Object

![[Pasted image 20240216090805.png]]

Step 3 - Design Deep Dive

Data Store

Has three main components

![[Pasted image 20240216092328.png]]

Data Routing uses RESTful or GRPC APIs to access datanode cluster. It can scale by adding mroe servicers

It queries the placement service to get best data node to store data
Reads data from data nodes to give to API service
Writes data to the data nodes

Placement Service

Uses a virtual cluster map of all data nodes
Uses heart beats to make sure the data node is available
Also contains location info for each data node to make sure replicas are physically separated
Paxos or Raft consensus protocol can build this service, best to have 5 or 7

Data Node

Stores the actual data Has a data service darmon running on it Sends heartbeats to the placement service

Says how much data is stored on each drive
Says how many disk drives the data node is managing

When placement receives the first ever message, it sets it on the virtual cluster map and gives a unique ID

Persisting Node Data

![[Pasted image 20240216092856.png]]

Notice the replication

Primary node responds once writes complete and the data routing service ACKS and sends object ID back to API

Can choose which level of replication you want. There’s a tradeoff between consistency and latency

ACK after 2 successful replicas (Slowest, but safest and highest consistency)
ACK after 1
ACK after 0 successful replicas but a success primary (lowest latency)

Data Routing Service

Three responsbilities

Query Placement service to get best data node
Read data from data nodes to return
Write data to data node

Placement Service

Determines which node to choose to store the data / object Uses virtual cluster map ![[Pasted image 20240502184231.png]] Used to make sure replicas are physically separated

If one region goes done, not all data is lost

Monitors all nodes through heartbeats, it node doesn’t send heartbeat within window, marked as DOWN on the map

Xu suggests using cluster of 5 or 7 nodes with Paxos or Raft consensus protocol***

idk what that means
Consensus protocol means that service will work as long as half nodes are UP

Data Node

Store actual object
Node will control several HDD / SDDs and send info through heartbeats to Placement Service
Also says how much data is stored on each drive
First hearbeat assigns an ID to the node and maps it on the virtual cluster map

How Data is Organized

Naive soltion is to store each object on it’s own file

Bad for scalibility / performance
- Wastes data blocks
  - Systems use data blocks, and 1 Byte file will take up entire 4KB block
- Could exceed system’s inode capacity
  - System isn’t used to having like 1M tiny files Solution
Merge files into a larger file
Basically a WAL (Write ahead log)
- Append all objects to read-write file, when it’s full, make a new read-write file to save it too
- Mark old as read-only ![[Pasted image 20240502185051.png]]

Writes must be serialized. Objects are stored in order, one after the other Multiple cores are needed to write in parallel and must take turns

Each core can have their own dedicated read-write files

What database to use for key-value store?

RocksDB
- Slower for reads (Uses an SS Table)
- S3 Is read heavy
Each Data Node can have it’s own little relationalDB like SQLite
- Never need to share data across nodes, only Node care about their own data
- File based DB with great reputation

Flow

API wants to save new object named object4 Appends new object onto read-write file names /data/c New record of object4 is inserted into object mapping table Data node service returns the UUID of object4 to API service

Object Look Up

Find the data using the file name, starting offset, and the size of data

How to store data

I’m skipping this section

Goes into details on how file storage works

Need the start offset of the object along with it’s size

Durability

Hardware failures and failure domain

We replicate data 3 times because each hard drive has about a 0.81% failure rate per year
3 data copies gives 0.999999 reliability or something

Failure domain requires isolating the physical environment, so one environment going down doesn’t destroy everything

Rack-level domain -> Everything for a server rack if it goes down

This is why Xu recommends availability zones ![[Pasted image 20240216093614.png]]

Erasure Coding to boost durability

Data is spliced into smaller pieces and places on different servers Creates parity If a failure occurs, we can reconstruct the pieces

![[Pasted image 20240216093742.png]]

You don’t lose 100% of the data

![[Pasted image 20240216093814.png]] Erasure has cost efficiency, don’t need to spend money on more stoarge

Greatly complicates node design though

Correctness Verification -> Data Corruption

In-memory data corruption happen a lot in big systems

Use checksums to verify if the data was correct

looks like a hashing algorithm that you can run to see if the data is the same

MD5, SHA1, HMAC

Metadata data model

Find object ID by name Insert and delete an object based on object name List objects in a bucket sharing the same prefix

![[Pasted image 20240216094536.png]]

Scaling the bucket

There’s a limit on buckets a user can create so the bucket size is small. So scale isn’t a problem

Scaling the object table

Holds the metadata for the object.

Can scale this table with sharding.

Which key to use?
bucket_Name can introduce hot key issue
object_id works but makes key range queries hard for query 2 and 3
- Maybe normalize?
- Use a hash of the bucket_name and object name as sharding key

Closing out this chapter to maybe do later

Things mentioned..

Query to list objects in bucket Single vs. Distributed databases for this query Uploading large objects Garbage collections object versioning

Object Versioning

Add new column called object_version

Instead of deleting objects, you always keep them in the mete-data store
Reroute the UUID’s and generate a new UUID for the old data
Basically it, although Xu wrote more examples of how it works

Optimizing Uploads of Large Files

Multipart Uploads

Start upload process, generate a ETag (Num of parts) When num of parts are all uploaded, mark as success

![[Pasted image 20240502190315.png]]

Garbage Collection

Reclaiming dead storage

Storage can be marked as deleted but never deleted
Half uploaded files can be never deleted
Corrupted Data

Occasionally checks all the objects and sees if they should be deleted Deletes object in primaries and backups

If Erasure coding, for an 8+4 setup, deletes from all 12 nodes

If Delete flag is true, it will delete all the objects