Alex Xu V2 - 1 Proximity Service

TLDR

How to Make a Map for a Service
- Quad Tree vs. Geohash vs. Google S2
How to create API endpoints
How to design SQL Tables / Using Replicas
How caching works like 101

It’s like Yelp / Google Maps

Clarify questions

Will the range expand if nothing is available?
Will businesses sign themselves up?
What happens if the user is moving?

Requirements

User can change their radius
Businesses can add, delete, update their info
Don’t worry about users moving

Back of Envelope Estimations for 100M Daily Users and 200M Businesses

100k (10^6) seconds in a day
1 (10x8) / 10^5 -> 10^3 -> 1,000 QPS
- 100M
  - 100(10x6)
    - 1(10x8)
users make 5 queries every day
- 5 x 1,000 QPS -> 5,000
5,000 QPS

Obvious heavy read application with low writes We’ll have an API that has a GET for search/nearby And API that handles CRUD for add,update,delete businesses GET -> /v1/business POST -> /v1/business PUT -> /v1/business

How do we split the earth to get all of these businsesses

Naive -> Search 2-D array space of Lat / Long
Zip-codes / Cities / Even distribution
Geohashing -> Split earth into Lat/ Long by recursively going downward
- 00010100
QuadTree / R Trees
Google S2 Geometry Library ![[Pasted image 20240503105401.png]]

Step 3 - Deep Dive

Scale the DB

Business Table -> Can use sharding on business ID???
Geospatial Index Table
- Better to add double rows for businesses in the same geohash
- Option 1 is to add multiple columns for a geo hash of a business (This is slow for updating, add, and delete)
Scaling Geospatial isn’t necessary because it’s only at 1.71GB of memory
Depending on QPS, the server needs mroe CPU or network bandwith
- Spread readload over mutliple database servers
Xu says it’s easier to just use read replicas, sharding gets too complicated with geohash

QuadTree

Data structure used to partition two dimensional space by recrusively subdividing it into four quadrants until criteria is reached
Use In-memory Tree Structure to answer queries
- Build Tree at Server run-time ![[Pasted image 20240503123546.png]]
![[Pasted image 20240503123637.png]]

But how does it handle adding, removing, and updating nodes?

Easiest way is to just rebuild the tree over a small subset of servers and walking up the entire cluster of servers
- Means some servers have stale data for a bit
Can set up a business agreement that data takes one day to reflect the change
It’s also possible to update quadtree on fly, but this complicated the design
- Especially if we’re using data is used by many threads ![[Pasted image 20240503124151.png]]

Geohashing vs Quadtree

Geohashing is easy to update and easy to implement

Quadtree requires traversing tree to delete nodes

Also more complicated to implement
And need to build the tree

Google S2

Google S2 Geometry Library is also used in apps like Google Maps / Tinder In-Memory solution Maps a sphere to a 1D index on the “Hilbert curve”

![[Pasted image 20240503110054.png]] Because it approximately maps phyiscal locations, can be used ot make geofences

geofence is a virtual permiter for real world geographic area
Can send notifications if a user enters an area of interest
S2 is has a Region Cover algorithm
- Can specify the min level, max level, and max cells to get more granular results
- More flexible

Xu’s Recommendation

He recommends to talk about Geohash or Quadtree in interview S2 is too complicated to explain clearly in interview

Deep Dive

How to Scale the Database

Business Table
- Just use sharding
Geospatial Index Table
- Option 1 -> Can map a geohash to a list of business IDs (Use JSON)
- Option 2 -> Can use SQL rows. Have a geohash column and business ID ![[Pasted image 20240503122718.png]] Option 2 is way safer with ACID. No need to even do locks with a compound key(geohash / Business_id) Don’t need to shard because the data is so small

Improvements

Cache

Not needed, dataset is so small and works within server memory
Data retrieval is just as fast as in memory
May need to talk about more numbers with interviewer to see if they really want cache
Use the geocache NOT the long/lat. Long/lat changes so often
- Use Redis
200M business –> 4 radiii ->
- Each business is one grid
- about 5GB of memory at 8KB per 200M businesses and 3 precisions
Can cache business ID for their pages to show to client

Region and Availability Zones

Make sure to route users at data centers that are near them
Keep track of privacy and how data is stored

Followup -> How to filter by time

Can filter post because we’ll only have like 100 businesses that we show near them

assuming open time and closing time is inside the table

![[Pasted image 20240503123227.png]]