Alex Xu V2 - 5 Metrics Monitoring and Alerting System

TLDR

Problem

Popular services include

Step 1 - Understand Problem and Establish Scope

Who are we building the system for? Is it in-house? Or is it SaaS?

Which metrics do we want to collect?

What is the scale of the infrastructure we are monitoring with the system?

How long to keep data?

Do we reduce the resolution of the mtrics data for long-term storage?

What are supported alert channels?

Do we need to collect logs?

Do we need to support distributed system tracing?

High-level requirements and assumptions

Now you have finished gathering requirements fromt he interviewer and have a clear scope of the design. The requirements are:

Non-Functional Requirements

Scalability Low Latency Reliability Flexibility

Out of scope

Step 2 - Propose High-level Design and Get Buy-in

Fundamentals

Examples of Metrics

1 -> Finding CPU Load of a specific machine

Time series Object

Data Access Pattern

Very write heavy, there’s a lot of time data points

10M operational metrics written per day

Read load is bursty / spiky. Can have data bases that really want to see the metrics for alerts or visualizations

Data Structure is too hard to use SQL for

Cassandra is great for heavy writes though

There are databases designed for timeseries data

You don’t need to know the internals of these databases because it’s so niche

Step 3 - Design Deep Dive

Metrics Collection

Dataloss isn’t terrible for counters or CPU Usage

Pull vs. Push models

Pull

Push Model

Pros and Cons to both ![[Pasted image 20240228083211.png]]

Metrics Transmission

Probably use Kafka for scale

We can further do partitions through Kafka

Alternative to Kafka

Where to Aggregate the data?

Client-side can only aggregate simple things

Ingesion Pipeline

Query side

Query Service

Cache Layer

Most popular industrial scale services already have a query service and cache layer built-in or through plug ins

Flux has it’s own built-in query language, like other metric queries

Storage Layer

Compression

Downsampling

Cold Storage

Buy or Build your own Alert / Visualization

Grafana is very good