Alex Xu V2 - 8 Distributed Email Service

TLDR

Problem

Large-scale email services like Gmail, Outlook, and Yahoo

Gmail has about 1.8B active users

Step 1 - Understanding the Problem and Establish Scope

Very important because email services are a complex system with many features. Need to narrow down what the interviewer cares about

How many people use the product?

I think these features are important. Is this good?

Answer -> Good list don’t worry about Authentication

How do users connect with mail servers

Can emails have attachments?

Non Functional Requirements

Availability Eventual Consistency Reliability

Back of Envelope Estimations

1B users

Step 2 -> Propose High Level Design and Get Buy-in

Email 101 - What is it

SMTP -> Standard protocol for sending emails from one mail server to another

POP -> Standard protocol for receiving and downloading emails from a remote mail server to a local email client. Once downloaded, they are deleted from the server.

IMAP -> Standard protocol for receiving emails fro a local email client

HTTPS -> Not a mail protocol, but can be used to access your mailbox

DNS

Looks up the mail exchange record (MX record) for a domain.

DNS will have several mail server options typically for a given domain

Attachments

Traditional mail servers

![[Pasted image 20240214164628.png]]

For storage, maildir was typically used. It stored it as a file system of users

![[Pasted image 20240214164800.png]]

This file structure became a bottleneck as users continued to grow. What is a server dies? What about back ups?

Need a better distribution layer

Distributed Mail Servers

Design to support the modern use cases and solve problems of scale and resilency

Email APIs

SMTP / POP / IMAP APIs for native mobile clients SMTP communications between sender and receiver servers RESTful API over HTP for full featuresd and interactive web-based emal applications

Xu will only cover some of most important APIs for webmail. HTTP

1 -> POST /v1/messages

Distributed Message Architecture

How do we store messages? How to we sync data across servers? How to keep email from flagged as spam?

![[Pasted image 20240214170103.png]]

Webmail -> web browsers to send and receive emails

Web servers -> login, sign up, profiles, sending email, loading folders, etc.

Real time servers -> Provide real-time updates to the user

Metadata database

Attachment store

Distributed Cache

Search store

Email Sending flow

![[Pasted image 20240214170653.png]]

We use basic email validation on the web servers and put in error queue

Outgoing Message Queue

SMTP Outgoing Serbvice

Email Receiving Flow

![[Pasted image 20240214171053.png]]

1 -> Incoming emails at SMTP Load balancer 2 -> Distributes traffic among SMTP servers

7 -> If connection / receiver online, email pushed to real-time servers

9 -> offline users will grab when they come online

10 -> Web servers pull new emails from storage and send to client

Step 3 - Design Deep Dive

Metadata database

Database options

Data Model

We can parition databases based on user_id

Primary key has two components -> Partition key and clustering key

Need to support the following queries

![[Pasted image 20240215071650.png]]

Query 2: Display all emails for a specific folder

We can sort the emails by timestamp. Partition them by their user_id and folder_id The email_id can be the TIMEUUID to use for sorting.

This let’s us find the unread emails fastest

emails

Query 3: Create / Delete / Get an email

![[Pasted image 20240215072244.png]]

Query 4: Fetch all read or unread emails

![[Pasted image 20240215072350.png]]

Xu’s solution said no SQL databases earlier though

NoSQL only supports queries on partition and cluster keys

One option is to fetch the entire folder and filter it ourselves but that’s inefficient

NoSQL can use Denormalization

Consistency Trade Off

Distributed Databases that use replication for high availability must make a tradeoff between consistency and availability. Correctness is very important for email systems, so we want a single primary for any given mailbox. The mailbox won’t be accessible to clients during a crash / failover, but we want consistency

Email Deliverability

It’s easy to send emails. hard to get them not flagged as spam

Use dedicated IPs from trusted sources Classify emails -> Send marketing emails from one IP address Email sender reputation -> Slowly build a good reputation for Google

Search depends on context. Emails have a lot of different attributes that can be searched for.

We’re constantly reindexing when a user sends, receives or deletes emails (What does this mean??)

This means the search for emails has a lot of writes

![[Pasted image 20240215073936.png]]

![[Pasted image 20240215074230.png]]

Reindexing can be done on the database, and we can query using a synchronous call to find the mail the user wants

Challenge of Elastic search is keeping primary email store in sync with it

Custom Search Solution

Designing an email search engine is very complicated and out of scope

Xu will talk about the disk I/O Bottleneck that one could face trying to do this

Users could easily have half a million emails. Attachments are at the PB daily level. So many writes.

Disk I/O is the main bottleneck

LSM trees are good for writes

![[Pasted image 20240215074618.png]]

Scalibility and Availability

Scales horizontally and can be made available using replication

Follow ups

Email Security and Compliance

Fault Tolerance