Alex Xu V2 - 9 S3 like Object Storage


S3 Launched in 2006

Storage System 101

There’s three broad categories of storage

Block Storage File Storage Object Storage

Block Storage

File Storage

Object Storage

![[Pasted image 20240216082313.png]]

![[Pasted image 20240216082328.png]]


Bucket -> Container for objects

Object -> Individual piece of data we store in bucket

Versioning -> Multiple variants of an object in the same bucket

Uniform Resource Identifier (URI) -> Object storage provides RESTful APIs

Service - level agreements (SLA) -> Contract between service provider and client

Step 1 - Understand Problem and Establish Scope

Which features should be included in the design?

What’s typical data size?

How much data do we need to store in one year

Can we assume durability is 6 nines? and service availability is 4 nines?

Non-functional Requirements

Back of Envelope Estimations

Object storage has bottlenecks of disk capacity or disk IO per second

Disk Capacity

IOPS -> one hard disk can do 100 - 150 random seeks per second (100 - 150 IOPS)

Use median of small, medium, and large object to simplify math 40% usage ratio gives

100PB = 100 x 1000 x 1000 x 1000 MB = 10^11 MB 10^11 x .4 / (.2x.5MB … …. ..) = .68B objects

Probably won’t use these numbers but Xu likes to have numbers

Step 2 -> Propose High level design and get buy-in

Characteristics of Object Storage

File storage in UNIX works by writing inodes that point to where the data is located

![[Pasted image 20240216085007.png]]

Load balancer

API service

Identity and acces management

Data store

Metadata store

Uploading an object

![[Pasted image 20240216090749.png]]

Downloading an Object

![[Pasted image 20240216090805.png]]

Step 3 - Design Deep Dive

Data Store

Has three main components

![[Pasted image 20240216092328.png]]

Data Routing uses RESTful or GRPC APIs to access datanode cluster. It can scale by adding mroe servicers

Placement Service
Data Node

Stores the actual data Has a data service darmon running on it Sends heartbeats to the placement service

When placement receives the first ever message, it sets it on the virtual cluster map and gives a unique ID

Persisting Node Data

![[Pasted image 20240216092856.png]]

Notice the replication

Primary node responds once writes complete and the data routing service ACKS and sends object ID back to API

Can choose which level of replication you want. There’s a tradeoff between consistency and latency

Data Routing Service

Three responsbilities

Placement Service

Determines which node to choose to store the data / object Uses virtual cluster map ![[Pasted image 20240502184231.png]] Used to make sure replicas are physically separated

Monitors all nodes through heartbeats, it node doesn’t send heartbeat within window, marked as DOWN on the map

Xu suggests using cluster of 5 or 7 nodes with Paxos or Raft consensus protocol***

Data Node

How Data is Organized

Naive soltion is to store each object on it’s own file

Writes must be serialized. Objects are stored in order, one after the other Multiple cores are needed to write in parallel and must take turns

What database to use for key-value store?


API wants to save new object named object4 Appends new object onto read-write file names /data/c New record of object4 is inserted into object mapping table Data node service returns the UUID of object4 to API service

Object Look Up

Find the data using the file name, starting offset, and the size of data

How to store data

I’m skipping this section

Goes into details on how file storage works

Need the start offset of the object along with it’s size


Hardware failures and failure domain

Failure domain requires isolating the physical environment, so one environment going down doesn’t destroy everything

This is why Xu recommends availability zones ![[Pasted image 20240216093614.png]]

Erasure Coding to boost durability

Data is spliced into smaller pieces and places on different servers Creates parity If a failure occurs, we can reconstruct the pieces

![[Pasted image 20240216093742.png]]

You don’t lose 100% of the data

![[Pasted image 20240216093814.png]] Erasure has cost efficiency, don’t need to spend money on more stoarge

Greatly complicates node design though

Correctness Verification -> Data Corruption

In-memory data corruption happen a lot in big systems

Use checksums to verify if the data was correct


Metadata data model

Find object ID by name Insert and delete an object based on object name List objects in a bucket sharing the same prefix

![[Pasted image 20240216094536.png]]

Scaling the bucket

There’s a limit on buckets a user can create so the bucket size is small. So scale isn’t a problem

Scaling the object table

Holds the metadata for the object.

Can scale this table with sharding.

Closing out this chapter to maybe do later

Things mentioned..

Query to list objects in bucket Single vs. Distributed databases for this query Uploading large objects Garbage collections object versioning

Object Versioning

Add new column called object_version

Optimizing Uploads of Large Files

Multipart Uploads

Start upload process, generate a ETag (Num of parts) When num of parts are all uploaded, mark as success

![[Pasted image 20240502190315.png]]

Garbage Collection

Reclaiming dead storage

Occasionally checks all the objects and sees if they should be deleted Deletes object in primaries and backups

If Delete flag is true, it will delete all the objects