Modern ranking systems (feeds, search, recommendations) have strict latency budgets, often under 200 ms at p99.
This write-up describes how we designed a production system using a decoupled microservice architecture for serving, a feature + vector store data layer, and an automated MLOps pipeline for training → deployment.
This is less about modeling, more about the infrastructure that keeps it all running.
Retrieval is the stage where a ranking system narrows billions of items down to a few hundred candidates, fast enough for real-time use. It’s the least visible but most constrained layer: latency budgets, freshness, and recall all collide here.
A concise explainer of the standard four-stage architecture used in most modern recommendation and ranking systems: retrieval, scoring, ordering, and feedback.
It walks through how these stages connect in production systems like search, feeds, and content recommendations, with diagrams and examples.
Part of a five-part series exploring each stage in more detail this week.
Curious how others here are evolving these pipelines. Are you moving toward more unified (retrieval+scoring) models, or keeping stages separate for latency and control?
Yeah exactly. I was really worried about reducing the serendipity that HN provides (as it's arguably why I've used it for so long as well) but the configurability allows it so that everyone can tweak their level of personalization to get their perfect goldilocks level.
Part 2 – Data Layer (feature store to prevent online/offline skew; vector DB choices and pre- vs post-filtering): https://www.shaped.ai/blog/the-infrastructure-of-modern-rank...
Part 3 – MLOps Backbone (training pipelines, registry, GitOps deployment, monitoring/drift/A-B): https://www.shaped.ai/blog/the-infrastructure-of-modern-rank...
Happy to share more detail (autoscaling policies, index swaps, point-in-time joins, GPU batching) if helpful.