TLDR
- How to Make a Map for a Service
- Quad Tree vs. Geohash vs. Google S2
- How to create API endpoints
- How to design SQL Tables / Using Replicas
- How caching works like 101
It’s like Yelp / Google Maps
Clarify questions
- Will the range expand if nothing is available?
- Will businesses sign themselves up?
- What happens if the user is moving?
Requirements
- User can change their radius
- Businesses can add, delete, update their info
- Don’t worry about users moving
Back of Envelope Estimations for 100M Daily Users and 200M Businesses
- 100k (10^6) seconds in a day
- 1 (10x8) / 10^5 -> 10^3 -> 1,000 QPS
- 100M
- 100(10x6)
- 1(10x8)
- 100(10x6)
- 100M
- users make 5 queries every day
- 5 x 1,000 QPS -> 5,000
- 5,000 QPS
Obvious heavy read application with low writes We’ll have an API that has a GET for search/nearby And API that handles CRUD for add,update,delete businesses GET -> /v1/business POST -> /v1/business PUT -> /v1/business
How do we split the earth to get all of these businsesses
- Naive -> Search 2-D array space of Lat / Long
- Zip-codes / Cities / Even distribution
- Geohashing -> Split earth into Lat/ Long by recursively going downward
- 00010100
- QuadTree / R Trees
- Google S2 Geometry Library ![[Pasted image 20240503105401.png]]
Step 3 - Deep Dive
Scale the DB
- Business Table -> Can use sharding on business ID???
- Geospatial Index Table
- Better to add double rows for businesses in the same geohash
- Option 1 is to add multiple columns for a geo hash of a business (This is slow for updating, add, and delete)
- Scaling Geospatial isn’t necessary because it’s only at 1.71GB of memory
- Depending on QPS, the server needs mroe CPU or network bandwith
- Spread readload over mutliple database servers
- Xu says it’s easier to just use read replicas, sharding gets too complicated with geohash
QuadTree
- Data structure used to partition two dimensional space by recrusively subdividing it into four quadrants until criteria is reached
- Use In-memory Tree Structure to answer queries
- Build Tree at Server run-time ![[Pasted image 20240503123546.png]]
- ![[Pasted image 20240503123637.png]]
But how does it handle adding, removing, and updating nodes?
- Easiest way is to just rebuild the tree over a small subset of servers and walking up the entire cluster of servers
- Means some servers have stale data for a bit
- Can set up a business agreement that data takes one day to reflect the change
- It’s also possible to update quadtree on fly, but this complicated the design
- Especially if we’re using data is used by many threads ![[Pasted image 20240503124151.png]]
Geohashing vs Quadtree
Geohashing is easy to update and easy to implement
Quadtree requires traversing tree to delete nodes
- Also more complicated to implement
- And need to build the tree
Google S2
Google S2 Geometry Library is also used in apps like Google Maps / Tinder In-Memory solution Maps a sphere to a 1D index on the “Hilbert curve”
![[Pasted image 20240503110054.png]]
Because it approximately maps phyiscal locations, can be used ot make geofences
- geofence is a virtual permiter for real world geographic area
- Can send notifications if a user enters an area of interest
- S2 is has a Region Cover algorithm
- Can specify the min level, max level, and max cells to get more granular results
- More flexible
Xu’s Recommendation
He recommends to talk about Geohash or Quadtree in interview S2 is too complicated to explain clearly in interview
Deep Dive
How to Scale the Database
- Business Table
- Just use sharding
- Geospatial Index Table
- Option 1 -> Can map a geohash to a list of business IDs (Use JSON)
- Option 2 -> Can use SQL rows. Have a geohash column and business ID ![[Pasted image 20240503122718.png]] Option 2 is way safer with ACID. No need to even do locks with a compound key(geohash / Business_id) Don’t need to shard because the data is so small
Improvements
Cache
- Not needed, dataset is so small and works within server memory
- Data retrieval is just as fast as in memory
- May need to talk about more numbers with interviewer to see if they really want cache
- Use the geocache NOT the long/lat. Long/lat changes so often
- Use Redis
- 200M business –> 4 radiii ->
- Each business is one grid
- about 5GB of memory at 8KB per 200M businesses and 3 precisions
- Can cache business ID for their pages to show to client
Region and Availability Zones
- Make sure to route users at data centers that are near them
- Keep track of privacy and how data is stored
Followup -> How to filter by time
Can filter post because we’ll only have like 100 businesses that we show near them
- assuming open time and closing time is inside the table
![[Pasted image 20240503123227.png]]