Alex Xu V2 - 6 Ad Click Event Aggregation

TLDR

There’s big billions of dollars in online advertising. This is the back-bone of the most popular social media apps.

Real-Time Bidding is a core component

Step 1 - Understand Problem and Establish Design Scope

What’s the form of the data?

What’s the data volume?

Important queries to support?

Do we need to worry about edge cases?

What’s latency requirement?

Functional Requirements

Non-Functional Requirements

Back of Envelope Estimations

Step 2 - Propose High Level Design and Get Buy In

API 1 -> Aggregate number of click for ad_id in last M Minutes

GET v1/ads/{:ad_id}/aggregated_count | Return aggregated event count for given ad_id

Request parameters are

Response

API 2: Return top N most clicked ad_ids in last M minutes

GET v1/ads/popular_ads

Request Parameters

Response

Data Model

Raw data looks like

Aggregated data

ad_id | timestamp | count

Can also add another table for the filter_id to do a JOIN for countries with an ad

Should we Store Raw or Aggregated Data

His suggestion -> Do Both

Choosing a database

It’s write heavy so Cassandra makes sense with NOSQL SQL is hard to scale Alternative is AWS S3 with a columnar data format like ORC, Parquet, or AVRO

Aggregated data is time-series in nature. Read and write heavy

High Level design

![[Pasted image 20240213083652.png]]

It’s too easy for data to get lost here and problems to occur

Use Message queues

![[Pasted image 20240213090031.png]]

Aggregation Service

MapReduce framework to aggregate ad click events

Use a DAG (directed acrilyic graph) to have different steps of what to do with the data

![[Pasted image 20240213090339.png]]

Why not just use Kafka?

Aggregate node

This is the Map Reduce Paradigm and it quite popuarl

Design Deep Dive

Streaming vs. Batching

Type of stream processing system

(Services(online), Batch(offline), Streaming(realtime))

![[Pasted image 20240213090928.png]]

Data Recalculation

If there’s bugs, sometimes we want to recalculate the aggregated data from raw.

Timestamp

When to generate a timestamp? When a click happens, or when it’s process?

Because we’re using message queues and disitrubted systems, the latency is large

![[Pasted image 20240213091417.png]]

If using an event time as the timestamp, can add a watermark to improve if delays occur

Aggregation window

DDIA says 4 types of windows

Tumbling and sliding window will be used in this system

Delivery Guarantees

Aggregation results are used for billing so data accuracy and completeness are important

Message Queues

Data deduplication

![[Pasted image 20240213092215.png]]

Use HDFS or S3 to record offsets before the crash

If it crashes before now, there could be message loss because the offset never updated

It’s not easy to dedupe data in large-scale systems. Exactly once is quite difficult

Scale the System

Adding more consumers can take a few minutes to rebalance, be careful and do it during off-peak hours

Brokers

Scale aggregation service

1 - Allocate events with differnt ad_ids to different threads like figure 6.23 2 - Deploy it on something like Apache Hadoop or YARN (Resource providers) ![[Pasted image 20240213092739.png]] Option 2 is more widely used, but option 1 should be easier to implement

Database

Cassandra natively suppports horizontal scaling

Hotspot issue

Major companies can place millions in ads. Therefore they will get more clicks If someone is popular, we can add more aggregation nodes to them

![[Pasted image 20240213093021.png]]

Fault Tolerance

Aggregation service is in memory, so it can go down and all data is lost.

Best to save an offset and redo from there

Take snapshots

Data monitoring and Correctness

Critical to monitor the system’s health to ensture correctness because we deal with billing and RTB auctions

Continuous Monitoring

Measure Latency and Message Queue Size

Reconcilation

Don’t have bank records that we can check.

We can sort the ad data though for every partition at end of day and compare with real-time data in the past hour?

![[Pasted image 20240213093435.png]]

Generalist System Design interviews don’t require internals of different pieces of specialized software used in a big data pipeline

Discussing thought process and trade-offs is very important.

This chapter focused on the best results for a generalist system design interview


Stream

How to Aggregate the in 24 hours or whatever

Scale Consumers by adding / removing nodes based on the flow of data

Resource Manager on the nodes to balance loads? Threads to balance loads?