data-science · English · 9 min

Building a Recommendation System: From Collaborative Filtering to Two-Tower Models

February 9, 2026

From Netflix to Shopee — how recommendation systems work, what the key algorithms are, and which one you should actually build for your use case.

Netflix saves roughly $1 billion per year from its recommendation system — not because it achieves perfect predictions, but because it reduces churn enough to justify the engineering investment many times over. If you work with user data at any scale, you will eventually face this problem.

This article covers the full landscape: collaborative filtering, matrix factorization, neural architectures, cold start handling, and evaluation metrics — enough to understand each trade-off and make the right design decision for your system.

The Core Problem

Given user u and an item catalog I containing thousands (or millions) of products, movies, songs, or posts — rank a personalized subset of I for u such that the items at the top are the ones the user is most likely to interact with.

In practice, the signal is rarely explicit (star ratings). Most production systems work with implicit feedback — clicks, purchases, plays, views. These signals are noisier and weaker than explicit ratings, but far more abundant. The majority of modern recommendation research focuses on implicit feedback.

Three Main Approaches

Collaborative Filtering

The core assumption: users who agreed in the past will agree in the future. No knowledge of item content is needed — only the history of interactions.

User-based CF finds users whose behavior resembles u, then recommends items they liked that u hasn't seen. The problem: computing similarity across all user pairs is O(n²) — unworkable at millions of users.

Item-based CF finds items similar to items u has already liked. Amazon uses a variant of this. Item similarities are more stable over time than user similarities and can be pre-computed offline.

Both variants compute similarity via cosine similarity or Pearson correlation on the sparse user-item matrix. The appeal is simplicity and interpretability. The main weaknesses are cold start (new users and items have no history) and sparsity (99%+ of entries in the matrix are missing).

Content-Based Filtering

Instead of looking at other users, content-based filtering builds a user profile from the features of items the user has liked, then retrieves similar items. Spotify uses audio features — tempo, energy, key — to model musical taste.

The clear advantage: no item cold start, as long as metadata exists. The drawbacks are the filter bubble effect (recommendations become increasingly narrow over time) and the requirement for high-quality item metadata, which takes effort to maintain.

Hybrid

Every serious production system is a hybrid in some form. Netflix uses a weighted combination of CF and content scores. The more common industrial pattern is a cascade: a Two-Tower model generates a candidate set at retrieval time, then a neural ranker re-ranks using richer features. Hybrid systems achieve the best accuracy but are the most complex to build and maintain.

Matrix Factorization: The Foundation Worth Implementing

Pure collaborative filtering struggles with sparsity. Matrix factorization solves this by learning compact latent representations.

The interaction matrix R (users × items) is decomposed into two smaller matrices: R ≈ P × Qᵀ, where P (users × k) and Q (items × k) contain k latent factors — typically 32 to 256.

SVD / Funk SVD minimizes RMSE on observed ratings using stochastic gradient descent. Simon Funk's variant was a key ingredient in winning the Netflix Prize in 2009. The surprise Python library implements this well for explicit rating data.

ALS (Alternating Least Squares) alternates between fixing P and solving for Q in closed form, then fixing Q and solving for P. This works far better for implicit feedback because it can weight unobserved interactions via confidence: c_ui = 1 + α × log(1 + frequency_ui). Items interacted with repeatedly get higher confidence, but the absence of interaction is treated as unknown rather than zero.

The implicit Python library implements ALS with GPU support and handles millions of interactions efficiently. This is the right first tool for implicit feedback problems at moderate scale.

Neural Collaborative Filtering

NCF (He et al., 2017) replaces the dot product in matrix factorization with a multi-layer perceptron, enabling the model to learn non-linear user-item interactions.

The full NeuMF architecture runs two parallel branches:

GMF branch: dot product of user and item embeddings (preserving classical MF)
MLP branch: embeddings concatenated and passed through several fully-connected layers

The outputs are concatenated and passed through a sigmoid to predict interaction probability.

Empirically, NCF outperforms pure matrix factorization on MovieLens and Pinterest. Adding layers helps, with diminishing returns past 3–4 layers. A PyTorch implementation runs under 100 lines — a natural first step from matrix factorization into deep learning for RecSys.

Wide & Deep and Two-Tower: Production-Grade Architectures

Wide & Deep

Wide & Deep (Cheng et al., 2016) was deployed by Google on Google Play for over 1 billion users.

Wide component: A generalized linear model on raw and cross-product features. Handles memorization — "users who installed puzzle games also install strategy games."
Deep component: A DNN over learned embeddings. Handles generalization — extrapolating from sparse observed patterns to unseen feature combinations.

Both components are trained jointly. Successor models like DeepFM and xDeepFM replace the wide component with factorization machines to automatically learn feature interactions.

Two-Tower (Dual Encoder)

The dominant architecture for candidate retrieval at scale — used by YouTube, Google, and Pinterest. The structure:

User tower: Encodes user context (interaction history, demographics) into an embedding vector
Item tower: Encodes item features (title, category, tags) into an embedding vector
Similarity: Dot product of the two output vectors

The key insight: separate towers allow all item embeddings to be pre-computed offline. At serving time, only the user embedding needs to be computed for the current request, then approximate nearest neighbor search (FAISS, ScaNN) retrieves the top-K items from billions in milliseconds.

Two-Tower also handles item cold start natively: the item tower operates on raw features, requiring no interaction history.

VAE-Based Recommendation

Mult-VAE (Liang et al., 2018) frames recommendation as a generative modeling problem. A user's entire interaction history is treated as a bag of items. An encoder maps this interaction vector into a Gaussian latent space; a decoder reconstructs a probability distribution over the full catalog.

Multinomial likelihood (rather than Gaussian) fits implicit feedback more naturally. KL annealing (the β-VAE approach) prevents posterior collapse during training.

RecVAE (2020) improves on Mult-VAE with a composite prior and alternating encoder-decoder training, reaching state-of-the-art on several implicit feedback benchmarks at the time of publication.

Both models work at the user level: given a user's interaction history, encode it, sample from the latent distribution, decode, and rank items by reconstruction probability. They are particularly strong for top-N recommendation with dense interaction data — theoretically richer than MF and often competitive with heavier neural approaches.

Session-Based Recommendation: When Users Are Anonymous

Many e-commerce and media platforms have a large fraction of anonymous users — no login, no history. The system must recommend based only on the current session's click sequence.

GRU4Rec (Hidasi et al., 2016) was the first to apply RNNs (specifically GRUs) to this setting. It models item click sequences as a time series and uses ranking loss (BPR or TOP1) rather than cross-entropy — a critical choice for recommendation tasks. It significantly outperformed item-to-item methods at publication.

SASRec (Kang & McAuley, 2018) applies a one-directional transformer (similar to GPT) to item sequences, using only the last N interactions with a left-to-right attention mask. It is faster than GRU4Rec and achieves state-of-the-art on multiple benchmarks. Interestingly, a single attention head consistently outperforms multi-head in practice.

BERT4Rec (Sun et al., 2019) uses a bidirectional transformer with a Cloze task — randomly masking items in a sequence and predicting them. It outperforms SASRec on some datasets but requires significantly more compute.

Cold Start: The First Production Problem

Cold start occurs at three levels:

User cold start: A new user, no history. Solutions:

Onboarding questions ("Pick 3 genres you enjoy")
Demographic or contextual features as a proxy
Populate-then-personalize: serve popular items first, shift to personalized once enough signal accumulates
Content-based filtering until sufficient interactions exist

Item cold start: A new item, no interactions. Solutions:

Use item content features to find similar existing items and inherit their embeddings
Two-Tower: the item tower processes raw features, so new items immediately receive an embedding
Bandit-based explore-exploit: force some exposure for new items

System cold start: A brand new deployment, no data at all. Solutions:

Pretrain on public datasets (MovieLens, Amazon Reviews)
Transfer learning from a related domain
Rule-based fallback (popular, new, curated lists)

Matrix factorization and NCF struggle most with cold start. Content-based filtering and Two-Tower have a structural advantage here.

Evaluation: Getting It Right

Recommendation is an information retrieval problem. The model returns a ranked list of K items; ground truth is the items the user actually interacted with in the holdout set.

Precision@K: Of the K recommended items, what fraction did the user actually interact with?
Precision@K = |recommended ∩ relevant| / K

Recall@K: Of all items the user interacted with, what fraction appeared in the top-K?
Recall@K = |recommended ∩ relevant| / |relevant|

NDCG@K (Normalized Discounted Cumulative Gain): A position-aware metric. Items ranked higher contribute more to the score. DCG@K = Σ relevance_i / log₂(i+1). NDCG@K normalizes to [0,1] against the ideal ranking. This is the most widely cited metric in research and industry.

Hit Rate@K: Binary — did at least one relevant item appear in the top-K? Averaged across users. Simple and intuitive.

MRR (Mean Reciprocal Rank): mean(1 / rank_of_first_relevant_item). Most useful in leave-one-out evaluation where only one ground-truth item exists per user.

Evaluation protocol matters: Temporal split (train on interactions before time T, test on after T) is more realistic than leave-one-out. Never use random split — it leaks future interactions into training.

Beyond accuracy: production systems also track diversity (avoiding clusters of near-identical items), novelty (recommending items users haven't encountered), catalog coverage (what fraction of items ever gets recommended), and serendipity (surprising but genuinely relevant). High NDCG alongside low coverage is a signal of popularity bias — the model is defaulting to well-known items rather than exploring the catalog.

When to Use Which Approach

Approach	Pros	Cons	Best For
Item-based CF	Simple, interpretable, fast	Cold start, poor scale	Small catalogs, explainability required
ALS (MF)	Scalable, implicit feedback, battle-tested	Cold start, no features	Medium-to-large scale, implicit data
NCF	Better accuracy than MF, captures non-linearity	Slower training, harder to tune	When MF has plateaued
Wide & Deep	Memorization + generalization	Complex feature engineering	Large-scale with rich features, ranking stage
Two-Tower	Billion-scale retrieval, handles cold start	Needs ANN infrastructure	Candidate generation at scale
Mult-VAE/RecVAE	Strong top-N, principled Bayesian	User-level inference, slow for new users	Dense interaction data
GRU4Rec	Sequential, session-aware	RNN slow, needs sequence data	Anonymous sessions, e-commerce clicks
SASRec	Faster than RNN, strong benchmarks	Needs sufficient sequence length	Sequential modeling at medium scale
Hybrid	Best accuracy	Complex to maintain, adds latency	Production systems where accuracy is critical

Practical scale heuristic:

Under 10K users/items: item-based CF or SVD
10K–1M: ALS via implicit, NCF in PyTorch
Over 1M: Two-Tower + FAISS, or managed services (Vertex AI Recommendations, AWS Personalize)

Where to Start

The full landscape above is the foundation. Practically: start with ALS on MovieLens 1M (freely available at grouplens.org), measure NDCG@10 and Recall@20, then try NCF and observe the delta. This is a workflow every ML engineer should run at least once — not to ship a perfect model, but to build genuine intuition about what each algorithm actually trades off.

What's actively emerging in 2024–2026: LLM-based recommendation (prompting language models to predict next items), graph neural networks via LightGCN for sparse datasets, and multi-modal item embeddings integrating image and audio through models like CLIP. Which of these will become the next dominant paradigm in production is still an open question.

Common Mistakes

Using random splits instead of temporal splits. In recommendation, random splits allow future interactions to leak into training — the model effectively looks into the future. Always split by time: train on interactions before time T, test on after T.
Optimizing offline metrics without tying them to business KPIs. Strong NDCG on a validation set doesn't guarantee CTR or revenue improvement in production. Offline metrics are a proxy — A/B tests are the only way to confirm that model improvements translate to actual user behavior change.
Ignoring popularity bias. Without tracking catalog coverage, most models default to recommending the same popular items to nearly everyone. Add diversity, novelty, and coverage to your evaluation pipeline; high NDCG with low coverage is a reliability warning sign.
Not designing for cold start from the beginning. Building a matrix factorization or NCF system without cold start handling means retrofitting a solution later — significantly more expensive. Cold start is a first-class design requirement, not an afterthought.
Over-engineering too early. Many teams skip ALS and jump straight to Two-Tower, only to discover that item-based CF would have been sufficient at their scale. Always build the simplest baseline first; increase complexity only when you have clear evidence it's necessary.

Key takeaways:

Three main paradigms: Collaborative Filtering (from user interaction history), Content-Based (from item features), Hybrid (combining both) — every serious production system is a hybrid in some form
Choose by scale: under 10K → item-based CF or SVD; 10K–1M → ALS via implicit; over 1M → Two-Tower + FAISS
Cold start must be a first-class design concern: Two-Tower and content-based have structural advantages; matrix factorization and NCF struggle most
Evaluate with temporal splits; NDCG@K is the standard metric but needs A/B testing to confirm real business impact
Always start with a simple ALS baseline — move to complex architectures only when ALS has become the performance ceiling

recommendation-systemscollaborative-filteringmachine-learningbasics

Sources