data-science · English · 15 min
🇻🇳 Đọc tiếng ViệtEvaluating ML Models the Right Way: From Accuracy to NDCG
January 26, 2026
A fraud detection model hits 99% accuracy while catching zero fraudulent transactions. This guide walks through the full taxonomy of ML evaluation metrics — classification, regression, ranking, clustering, and cross-validation — and gives you a decision framework for choosing the right metric for any business problem.
A fraud detection model scores 99% accuracy. Impressive — until you realize that 99% of all transactions are legitimate, and the model simply predicts "not fraud" every single time. Zero fraud caught. Accuracy: 99%. Practical value: zero.
This is the core problem with evaluation metrics: they are proxies for the truth, not the truth itself. Every metric compresses information, and every compression makes trade-offs. The choice of metric shapes what your model learns to optimize — and if that choice is wrong, the consequences show up in production, not in your notebook.
This guide covers the full taxonomy: classification, regression, ranking, and clustering metrics, followed by cross-validation best practices and a decision framework you can apply to any new project.
The Confusion Matrix: Where Everything Starts
Every binary classification metric derives from four counts. Given a model that predicts positive or negative, there are exactly four outcomes:
- True Positive (TP): predicted positive, actually positive
- True Negative (TN): predicted negative, actually negative
- False Positive (FP) / Type I Error: predicted positive, actually negative — a false alarm
- False Negative (FN) / Type II Error: predicted negative, actually positive — a missed case
Concrete example: 1,000 patients, 100 of whom actually have the disease. Your model flags 120 as positive: 80 of them are genuinely sick (TP), 40 are healthy but flagged (FP). Of the 100 true cases, 20 were missed (FN).
As Bishop's foundational textbook establishes, every classification metric is a function of these four cells. Understand the confusion matrix and every downstream metric clicks into place immediately.
Classification Metrics
Accuracy — and When It Betrays You
Accuracy = (TP + TN) / (TP + TN + FP + FN)
The most intuitive metric: the fraction of predictions that are correct. Also the most commonly misapplied. Accuracy is only meaningful when classes are roughly balanced. On a 99%/1% dataset, a model that always predicts the majority class achieves 99% accuracy while having learned nothing useful.
Precision and Recall — Two Sides of a Trade-off
Precision = TP / (TP + FP)
Of all the samples your model calls positive, what fraction actually are? Precision asks: when I say yes, how often am I right?
Recall (Sensitivity) = TP / (TP + FN)
Of all samples that are genuinely positive, what fraction does your model find? Recall asks: out of all real positives, how many did I catch?
The fundamental tension between these two is inescapable. Raising the classification threshold pushes precision up and recall down. Lowering it does the opposite. As Davis & Goadrich demonstrated at ICML 2006, this trade-off is structurally inherent to binary classification — you cannot maximize both simultaneously.
Which to prioritize depends entirely on the cost structure of your problem:
| Use Case | Prioritize | Reason |
|---|---|---|
| Cancer screening | Recall | A missed case (FN) can be fatal |
| Spam filter | Precision | Losing a legitimate email (FP) is costly |
| Fraud detection | Recall first, then Precision | Missed fraud is expensive; too many false alerts erode user trust |
| Legal document review | Recall | Cannot afford to miss relevant documents |
| Product recommendations | Precision | Irrelevant suggestions destroy the user experience |
F1-Score and F-Beta
F1 = 2 × (Precision × Recall) / (Precision + Recall)
The harmonic mean of precision and recall. It penalizes extreme imbalance between the two: a model with precision 1.0 and recall 0.1 scores F1 = 0.18, not 0.55. Use F1 when you cannot afford to sacrifice either metric.
F-beta = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)
β > 1 weights recall more (medical screening). β < 1 weights precision more (spam filtering).
ROC-AUC — Threshold-Agnostic Discrimination
The ROC curve plots True Positive Rate (Recall) against False Positive Rate (1 - Specificity) across every possible classification threshold. Fawcett (2006) offers the most useful intuition:
- A diagonal line (AUC = 0.5) means the model is no better than random guessing
- AUC = 1.0 is a perfect classifier
- Probabilistic reading: AUC = 0.85 means that if you pick a random positive and a random negative example, the model ranks the positive higher 85% of the time
ROC-AUC is threshold-agnostic and relatively robust to moderate class imbalance.
When to use PR-AUC instead: When the positive class is very rare (fraud, rare disease), ROC-AUC can be misleadingly optimistic — the large number of true negatives dominates the FPR denominator. The area under the Precision-Recall curve (PR-AUC) gives a more honest picture in these cases.
Practical rule: Use ROC-AUC when the negative class matters. Use PR-AUC when only the rare positive class matters.
Log Loss — Evaluating Probability Estimates
Log Loss = -1/N × Σ [y × log(p) + (1 - y) × log(1 - p)]
Log loss penalizes confident wrong predictions severely. A model that says 95% positive and is wrong incurs far more penalty than one that says 60%. Log loss = 0 is perfect; it has no upper bound (higher is worse).
Use log loss when predicted probabilities matter — risk scoring, medical decisions, credit scoring. It is also the native training loss for logistic regression and neural networks, so optimizing it directly aligns training and evaluation objectives.
Regression Metrics
MAE — Interpretable and Robust
MAE = 1/N × Σ |y_i - ŷ_i|
Mean Absolute Error is the average absolute deviation, in the same units as the target. It is straightforward to explain to stakeholders: "on average, the model's prediction is off by X." Because it uses absolute value rather than squaring, MAE is robust to outliers.
Use MAE when outliers in the data do not reflect costs you care about, and you want equal penalty for all errors.
RMSE — Sensitive to Large Errors
RMSE = √(1/N × Σ (y_i - ŷ_i)²)
Root Mean Squared Error squares errors before averaging, making large mistakes disproportionately costly. As Willmott & Matsuura (2005) analyzed, RMSE is more sensitive to outliers than MAE and is appropriate when large errors have serious consequences — inventory shortfalls, structural failures, financial losses.
RMSE is in the same units as the target (unlike raw MSE), so it remains interpretable.
MAPE — Scale-Free Comparison
MAPE = 100% × 1/N × Σ |y_i - ŷ_i| / |y_i|
MAPE is unit-free, making it useful when comparing model performance across products, markets, or scales. It is the common language of business forecasting teams.
Key limitations: undefined when any true value is zero, and asymmetric — a 20% overestimate and 20% underestimate are not treated equally. Hyndman & Koehler (2006) recommend MASE (Mean Absolute Scaled Error) for time series, which scales errors relative to a naive baseline rather than the true value.
R² — Fraction of Variance Explained
R² = 1 - SS_res / SS_tot
R² = 1 is a perfect fit. R² = 0 means the model performs no better than predicting the mean. R² < 0 means it performs worse.
Important caveat: a high R² does not guarantee small prediction errors. When data variance is large, R² can look good even when absolute errors are substantial. Always pair R² with RMSE or MAE for scale context.
Adjusted R² = 1 - (1 - R²) × (N-1)/(N-k-1) penalizes adding features that do not genuinely help — use it when comparing models with different feature counts.
Ranking Metrics
Search engines and recommendation systems don't just predict a binary label — they rank results. The question shifts from "is this prediction correct?" to "is this ordering good?"
MRR — Mean Reciprocal Rank
For each query, find the rank of the first relevant result. Score = 1/rank. Average across all queries.
Example: the correct answer appears at rank 3 → score = 1/3 = 0.333.
Use MRR when only the first relevant result matters: Q&A systems, entity lookup, single-answer retrieval.
MAP — Mean Average Precision
For each query with multiple relevant documents, compute AP = the average of precision values at each position where a relevant document appears. MAP is the average AP across all queries.
Use MAP when there are multiple relevant results per query and relevance is binary (relevant or not).
NDCG — Normalized Discounted Cumulative Gain
DCG = Σ (2^rel_i - 1) / log₂(i + 1), where rel_i is the relevance score at position i.
NDCG = DCG / IDCG, where IDCG is the DCG of the ideal (perfect) ranking.
Järvelin & Kekäläinen (2002) designed NDCG to handle graded relevance (not just binary 0/1) and to discount the value of lower positions — a relevant result at rank 1 contributes far more than at rank 10. NDCG@10 is the standard for search and recommendation systems in industry.
| Metric | Relevance Type | Multiple Relevant? | Position-Sensitive? | Best For |
|---|---|---|---|---|
| MRR | Binary | No | First result only | Q&A, entity lookup |
| MAP | Binary | Yes | Yes | Multi-doc retrieval |
| NDCG | Graded | Yes | All positions | Recommenders, search |
Clustering Metrics
Clustering evaluation differs from supervised learning because you often lack ground truth labels. Two families of metrics handle this.
Internal Metrics (no labels required)
Silhouette Score: For each point i, compute a(i) = mean distance to points in the same cluster, b(i) = mean distance to the nearest other cluster. s(i) = (b(i) - a(i)) / max(a(i), b(i)).
Values range from -1 to 1. Close to 1: the point is well-matched to its cluster. Near 0: on the cluster boundary. Negative: likely misclustered. Rousseeuw (1987) introduced this to aid K selection and algorithm comparison.
Davies-Bouldin Index: Lower is better. Measures the average ratio of within-cluster scatter to between-cluster distance. Computationally cheaper than Silhouette on large datasets.
External Metrics (require ground truth labels)
Adjusted Rand Index (ARI): Corrects the Rand Index for chance. Range [-1, 1]; 1 = perfect agreement, 0 = random. Always prefer ARI over the raw Rand Index, which is biased toward 1 at high K.
Normalized Mutual Information (NMI): Measures shared information between predicted clusters and true labels, normalized to [0, 1]. Robust to differences in the number of clusters.
In practice: Use Silhouette and Davies-Bouldin for hyperparameter tuning (selecting K). Use ARI and NMI when you have a validation label set — for example, document categories or customer segments with known ground truth on a subset.
Cross-Validation: Honest Generalization Estimates
A single train/test split is fast but fragile — the result depends heavily on which samples ended up in which partition. Cross-validation averages over multiple splits to produce a more reliable estimate.
K-Fold Cross-Validation
Split data into K folds. Train on K-1, test on the remaining one. Rotate until every fold has been used as the test set once. Average the K scores.
K = 5 or 10 is standard. Kohavi (1995) showed K = 10 generally gives the best bias-variance trade-off. Use when data is large enough (1,000+ rows) and has no temporal or group structure.
Stratified K-Fold
Each fold preserves the original class ratio. This is mandatory for imbalanced datasets — a fold with no positive examples produces a meaningless recall score. Always use stratified K-fold for classification problems.
Time-Series Split (Walk-Forward Validation)
Never shuffle time-series data. Future data must not appear in the training set for any past fold. Walk-forward validation trains on [0, t] and tests on [t+1, t+w], expanding the training window each iteration.
scikit-learn's TimeSeriesSplit implements this correctly. Many practitioners still use standard K-fold on time-series data, inadvertently leaking future information into training and producing over-optimistic results.
Group K-Fold
Prevents data from the same entity — same user, same hospital, same subject — from appearing in both train and test. Critical for medical datasets and user-level models where the same entity contributes multiple rows.
Six Common Evaluation Mistakes
Mistake 1: Accuracy on imbalanced data
A model that always predicts the majority class on a 99%/1% dataset achieves 99% accuracy while catching zero positive cases. Fix: use F1, PR-AUC, or recall at a fixed precision threshold.
Mistake 2: Leaky preprocessing in cross-validation
Fitting StandardScaler on the full dataset before splitting introduces leakage — the test fold has already influenced the scaling parameters. The fix is to fit the scaler inside each training fold only. Use scikit-learn's Pipeline to enforce this automatically:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression())
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')Mistake 3: Target leakage
A feature is a proxy for the label — for example, diagnosis_code in a model predicting disease_present. The model learns the answer from the feature rather than from genuinely predictive patterns.
Warning signs: AUC above 0.95 on a hard problem, or feature importance concentrated in a single feature. Kaufman et al. (2012) and Kapoor & Narayanan (2023) document this as one of the primary drivers of the ML reproducibility crisis.
Mistake 4: Evaluating on the wrong distribution
Training on 2019–2022 data and testing on 2022 looks fine, but deploying in 2023 after a distribution shift reveals the model fails in production despite strong CV scores. Fix: hold out a temporal test set. Monitor production metrics continuously.
Mistake 5: Single-metric fixation
Reporting only AUC without threshold-specific metrics. In deployment you operate at one threshold, not across all thresholds. Fix: always report AUC alongside precision and recall at your operating threshold, plus the confusion matrix.
Mistake 6: Not testing the metric itself
Log loss computed on mislabeled samples, NDCG calculated with incorrectly constructed relevance judgments. Sanity-check the metric on known edge cases before training.
How to Choose the Right Metric
A five-step decision framework you can apply to any project:
Step 1 — Identify the task type:
- Binary classification → F1, ROC-AUC, PR-AUC, Log Loss
- Multi-class classification → macro/weighted F1, Accuracy (if balanced)
- Regression → RMSE (outlier-sensitive), MAE (robust), MAPE (scale-free), R²
- Ranking/retrieval → NDCG@K, MAP, MRR
- Clustering → Silhouette (no labels), ARI/NMI (with labels)
Step 2 — Identify error asymmetry:
- FP more costly than FN → prioritize Precision
- FN more costly than FP → prioritize Recall
- Symmetric costs → F1, Accuracy (if balanced), RMSE
Step 3 — Check class balance:
- Balanced → Accuracy + macro F1
- Imbalanced → PR-AUC + weighted F1 or recall at fixed precision
Step 4 — Do probabilities matter?
- Yes (risk scoring, medical) → Log Loss + calibration plot
- No (only the label matters) → Accuracy, F1
Step 5 — Connect to business KPI:
| Problem | Primary Metric | Maps to Business KPI |
|---|---|---|
| COVID screening | Recall ≥ 95% | Case detection rate |
| Credit card fraud | Recall at Precision ≥ 50% | Investigation cost per missed fraud |
| E-commerce recommender | NDCG@10 | Click-through rate (A/B validated) |
| House price prediction | RMSE + MAPE | Valuation error cost |
| Customer churn | F1 + ROC-AUC at 70% recall | Retention team capacity |
Metrics are proxies, not ground truth. AUC can be high while production performance is poor. Accuracy can look strong while the model is completely useless for its actual purpose. The question that cuts through all of this is consistent: does this number reflect what actually matters to the end user and the business?
Once you can answer that cleanly — and trace a line from the metric on your evaluation report to a real consequence in the world — you are evaluating models the right way.
Key takeaways:
- Every metric compresses information with trade-offs — choosing the right metric must start from the business problem, not the algorithm
- Accuracy is only meaningful with balanced classes; for imbalanced data use F1, PR-AUC, or recall at fixed precision
- For ranking/recommendation: NDCG@K is the standard; for regression: combine RMSE (penalizes large errors) with MAE (robust to outliers)
- Data leakage and target leakage are the most common reasons metrics look great in validation but fail in production
- Always map your technical metric to a specific business KPI — if you can't make that connection, the metric has no actionable meaning
Sources
- Pattern Recognition and Machine Learning — Bishop (2006)
- Introduction to Information Retrieval — Manning, Raghavan & Schütze (2008)
- The Relationship Between Precision-Recall and ROC Curves — Davis & Goadrich (ICML 2006)
- An Introduction to ROC Analysis — Fawcett (2006)
- Another look at measures of forecast accuracy — Hyndman & Koehler (2006)
- Leakage in data mining — Kaufman et al. (ACM TKDD, 2012)
- Leakage and the Reproducibility Crisis in ML-based Science — Kapoor & Narayanan (Patterns, 2023)
- Cumulated gain-based evaluation of IR techniques (NDCG) — Järvelin & Kekäläinen (ACM TOIS, 2002)
- A study of cross-validation and bootstrap — Kohavi (IJCAI, 1995)
- Data Science for Business — Provost & Fawcett (O'Reilly, 2013)
- Silhouettes — Rousseeuw (1987)
- scikit-learn User Guide — Model evaluation