Scaling From 5‑Second Queries to 15 Million QPS
The Redis + Geohash playbook that turns GPS fire-hoses into single-digit-ms rides 🚀
so, it’s Friday night in downtown San Francisco 🎉
riders smash “Request a ride”
the Lyft app 🚗 screens freeze for 5 seconds.. enough time to delete and reopen TikTok
cause? a single MongoDB collection locked under a tsunami of GPS pings 😵💫
remedy? 1,000+ Redis shards + geohash 👇
in this post we’ll unpack exactly how Lyft’s location service leaped from one 5‑second query to ≈ 15 million requests per second at single‑digit‑millisecond latency.. and what parts you can steal for your own systems 🤝
what will be covered 👇
TL;DR for the impatient
The Friday‑Night Meltdown 😱
Root cause 🔍
The three‑step Redis makeover 🛠️
Deep dive: a ride‑search request, line‑by‑line
Ops & capacity planning 📊
Re‑usable system‑design takeaways 🤝
Interview one‑liners
the tl;dr
⚠️ most points here are coming from this presenation
monolithic Mongo collection → global write lock.
reads and writes stalled for ≈ 5s
geohash splits Earth into tiny squares; nearby squares share prefixes (great for range scans)
each square → own Redis sorted‑set on its own shard
look ups =
O(#cells) time complexity
≈ 9 sets
nutcracker (twemproxy) + consistent hashing spread ~1 k shards
lose one → only that slice of the map blinks
result: 15 M QPS @ p99 < 10 ms
.. and headroom to 4× peak
perfect match for soft‑state data - drivers resend GPS every few seconds, so cache loss == no drama
the Friday‑Night meltdown
back in 2014 a rider search looked like this:
every write/read hit the same index rows for busy regions.
Friday night surge? → global lock leading to 5 second response time
i.e. riders & drivers churned..
what exactly was wrong ?
one giant table for driver GPS
sharding by city → hot shards (SF has way more traffic than ≫ Madeira 😅)
reads and writes collide on same rows
disk‑backed DB for data that expires in 30 s ⇒ wasted IO
the three‑step Redis makeover
now, lets move to a quick code demo (with some pseudo-code) 👇
encodes the driver’s latitude/longitude into a 5‑character Geohash cell
writes the raw coordinate (
SET
) and adds the driver ID to a time‑scored sorted set for that cell (ZADD
)performs an ad‑hoc TTL by deleting entries older than 30 seconds (
ZREMRANGEBYSCORE
).executes all three commands in one Redis pipeline round‑trip (via
pipe.sync()
)
computes the rider’s current Geohash cell and its eight neighbors
pipelines a
ZRANGE
for each cell to fetch driver IDs in a single network hopflattens & de‑duplicates the IDs into a Java
Set<String>
for later ranking (e.g., by ETA)
… only 9 Redis reads - all parallelised by the proxy
a ride‑search walkthrough..
Driver ping → App → API → Hashes location & hits twemproxy
twemproxy hashes the
geo_id
→ picks shard →SET + ZADD
rider search → Backend computes rider cell + 8 neighbors
fan‑out 9
ZRANGE
calls → twemproxy multiplexes themfan‑in results, filter by ETA & driver state, return < 30 ms end‑to‑end
ops & capacity planning 📊
1 k+ Redis instances across multiple EC2 ASGs (Auto Scaling Groups)
4× headroom rule ⇒ survive loss of an AZ (Availability Zones) without throttling
Nutcracker keeps client connection count low & supports pipeline retries
dashboards: QPS (Queries Per Second) per shard, memory fragmentation, eviction hits
chaos drill: kill 5% of nodes weekly; expect auto‑heal in < 1 min
re‑usable takeaways 🤝
shard by access pattern, not by obvious domain entity (cities were prone to hot-shards; 1‑km squares weren’t)
soft‑state (GPS pings) → in‑memory cache; recovery = data replay, not backup restore
sorted sets double as index and TTL sweeper
keep capacity = 4 × peak so one failure ≠ outage
pipeline aggressively (
MULTI/EXEC
or client pipelining) to amortise network RTT
interview one‑liners 🎤
you can use this as quick-liners in some interview, speaking from experience - interviewers really love when you add such ‘flash-references’. why?
because it showcases that you are interested into diving deeper in problems and see how the bug guys did it! 🤝
geohash everywhere: “We slice Earth into 1 km Geohash cells; each cell lives in its own Redis sorted‑set. A rider query touches 9 cells → constant‑time lookup.”
soft‑state cache: “Driver GPS is ephemeral—if a Redis shard dies, drivers re‑ping within 30 s, so availability trumps persistence.”
resilience by design: “Consistent hashing + 4× headroom lets us pull the plug on a shard and still push 15 M QPS at p99 < 10 ms.”
references & further reading 📚
Daniel Hochman, Geospatial Indexing at Scale – The 15 M QPS Redis Architecture Powering Lyft 👇
Redis GEO commands docs (why Lyft rolled its own - additionally check this one for the product PoV)
Lyft engineer made a YouTube vid on Redis’s YouTube channel. Top tier video, definitely worth a watch!
see you next Sunday! ❤️
hope you found this valuable! The video and going deeper with the blog-article really helped me to master the theory behind scaling real world systems!
let’s crush it this week!
oh.. here’s a link to the YouTube vid that went over the whole migration!
Drop a ❤️ to help me spread the knowledge & to let me know you’d like more of this!