data-science · English · 15 min

Evaluating ML Models the Right Way: From Accuracy to NDCG

January 26, 2026

A fraud detection model hits 99% accuracy while catching zero fraudulent transactions. This guide walks through the full taxonomy of ML evaluation metrics — classification, regression, ranking, clustering, and cross-validation — and gives you a decision framework for choosing the right metric for any business problem.

A fraud detection model scores 99% accuracy. Impressive — until you realize that 99% of all transactions are legitimate, and the model simply predicts "not fraud" every single time. Zero fraud caught. Accuracy: 99%. Practical value: zero.

This is the core problem with evaluation metrics: they are proxies for the truth, not the truth itself. Every metric compresses information, and every compression makes trade-offs. The choice of metric shapes what your model learns to optimize — and if that choice is wrong, the consequences show up in production, not in your notebook.

This guide covers the full taxonomy: classification, regression, ranking, and clustering metrics, followed by cross-validation best practices and a decision framework you can apply to any new project.

The Confusion Matrix: Where Everything Starts

Every binary classification metric derives from four counts. Given a model that predicts positive or negative, there are exactly four outcomes:

True Positive (TP): predicted positive, actually positive
True Negative (TN): predicted negative, actually negative
False Positive (FP) / Type I Error: predicted positive, actually negative — a false alarm
False Negative (FN) / Type II Error: predicted negative, actually positive — a missed case

Concrete example: 1,000 patients, 100 of whom actually have the disease. Your model flags 120 as positive: 80 of them are genuinely sick (TP), 40 are healthy but flagged (FP). Of the 100 true cases, 20 were missed (FN).

As Bishop's foundational textbook establishes, every classification metric is a function of these four cells. Understand the confusion matrix and every downstream metric clicks into place immediately.

Classification Metrics

Accuracy — and When It Betrays You

Accuracy = (TP + TN) / (TP + TN + FP + FN)

The most intuitive metric: the fraction of predictions that are correct. Also the most commonly misapplied. Accuracy is only meaningful when classes are roughly balanced. On a 99%/1% dataset, a model that always predicts the majority class achieves 99% accuracy while having learned nothing useful.

Precision and Recall — Two Sides of a Trade-off

Precision = TP / (TP + FP)

Of all the samples your model calls positive, what fraction actually are? Precision asks: when I say yes, how often am I right?

Recall (Sensitivity) = TP / (TP + FN)

Of all samples that are genuinely positive, what fraction does your model find? Recall asks: out of all real positives, how many did I catch?

The fundamental tension between these two is inescapable. Raising the classification threshold pushes precision up and recall down. Lowering it does the opposite. As Davis & Goadrich demonstrated at ICML 2006, this trade-off is structurally inherent to binary classification — you cannot maximize both simultaneously.

Which to prioritize depends entirely on the cost structure of your problem:

Use Case	Prioritize	Reason
Cancer screening	Recall	A missed case (FN) can be fatal
Spam filter	Precision	Losing a legitimate email (FP) is costly
Fraud detection	Recall first, then Precision	Missed fraud is expensive; too many false alerts erode user trust
Legal document review	Recall	Cannot afford to miss relevant documents
Product recommendations	Precision	Irrelevant suggestions destroy the user experience

F1-Score and F-Beta

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean of precision and recall. It penalizes extreme imbalance between the two: a model with precision 1.0 and recall 0.1 scores F1 = 0.18, not 0.55. Use F1 when you cannot afford to sacrifice either metric.

F-beta = (1 + β²) × (Precision × Recall) / (β² × Precision + Recall)

β > 1 weights recall more (medical screening). β < 1 weights precision more (spam filtering).

ROC-AUC — Threshold-Agnostic Discrimination

The ROC curve plots True Positive Rate (Recall) against False Positive Rate (1 - Specificity) across every possible classification threshold. Fawcett (2006) offers the most useful intuition:

A diagonal line (AUC = 0.5) means the model is no better than random guessing
AUC = 1.0 is a perfect classifier
Probabilistic reading: AUC = 0.85 means that if you pick a random positive and a random negative example, the model ranks the positive higher 85% of the time

ROC-AUC is threshold-agnostic and relatively robust to moderate class imbalance.

When to use PR-AUC instead: When the positive class is very rare (fraud, rare disease), ROC-AUC can be misleadingly optimistic — the large number of true negatives dominates the FPR denominator. The area under the Precision-Recall curve (PR-AUC) gives a more honest picture in these cases.

Practical rule: Use ROC-AUC when the negative class matters. Use PR-AUC when only the rare positive class matters.

Log Loss — Evaluating Probability Estimates

Log Loss = -1/N × Σ [y × log(p) + (1 - y) × log(1 - p)]

Log loss penalizes confident wrong predictions severely. A model that says 95% positive and is wrong incurs far more penalty than one that says 60%. Log loss = 0 is perfect; it has no upper bound (higher is worse).

Use log loss when predicted probabilities matter — risk scoring, medical decisions, credit scoring. It is also the native training loss for logistic regression and neural networks, so optimizing it directly aligns training and evaluation objectives.

Regression Metrics

MAE — Interpretable and Robust

MAE = 1/N × Σ |y_i - ŷ_i|

Mean Absolute Error is the average absolute deviation, in the same units as the target. It is straightforward to explain to stakeholders: "on average, the model's prediction is off by X." Because it uses absolute value rather than squaring, MAE is robust to outliers.

Use MAE when outliers in the data do not reflect costs you care about, and you want equal penalty for all errors.

RMSE — Sensitive to Large Errors

RMSE = √(1/N × Σ (y_i - ŷ_i)²)

Root Mean Squared Error squares errors before averaging, making large mistakes disproportionately costly. As Willmott & Matsuura (2005) analyzed, RMSE is more sensitive to outliers than MAE and is appropriate when large errors have serious consequences — inventory shortfalls, structural failures, financial losses.

RMSE is in the same units as the target (unlike raw MSE), so it remains interpretable.

MAPE — Scale-Free Comparison

MAPE = 100% × 1/N × Σ |y_i - ŷ_i| / |y_i|

MAPE is unit-free, making it useful when comparing model performance across products, markets, or scales. It is the common language of business forecasting teams.

Key limitations: undefined when any true value is zero, and asymmetric — a 20% overestimate and 20% underestimate are not treated equally. Hyndman & Koehler (2006) recommend MASE (Mean Absolute Scaled Error) for time series, which scales errors relative to a naive baseline rather than the true value.

R² — Fraction of Variance Explained

R² = 1 - SS_res / SS_tot

R² = 1 is a perfect fit. R² = 0 means the model performs no better than predicting the mean. R² < 0 means it performs worse.

Important caveat: a high R² does not guarantee small prediction errors. When data variance is large, R² can look good even when absolute errors are substantial. Always pair R² with RMSE or MAE for scale context.

Adjusted R² = 1 - (1 - R²) × (N-1)/(N-k-1) penalizes adding features that do not genuinely help — use it when comparing models with different feature counts.

Ranking Metrics

Search engines and recommendation systems don't just predict a binary label — they rank results. The question shifts from "is this prediction correct?" to "is this ordering good?"

MRR — Mean Reciprocal Rank

For each query, find the rank of the first relevant result. Score = 1/rank. Average across all queries.

Example: the correct answer appears at rank 3 → score = 1/3 = 0.333.

Use MRR when only the first relevant result matters: Q&A systems, entity lookup, single-answer retrieval.

MAP — Mean Average Precision

For each query with multiple relevant documents, compute AP = the average of precision values at each position where a relevant document appears. MAP is the average AP across all queries.

Use MAP when there are multiple relevant results per query and relevance is binary (relevant or not).

NDCG — Normalized Discounted Cumulative Gain

DCG = Σ (2^rel_i - 1) / log₂(i + 1), where rel_i is the relevance score at position i.

NDCG = DCG / IDCG, where IDCG is the DCG of the ideal (perfect) ranking.

Järvelin & Kekäläinen (2002) designed NDCG to handle graded relevance (not just binary 0/1) and to discount the value of lower positions — a relevant result at rank 1 contributes far more than at rank 10. NDCG@10 is the standard for search and recommendation systems in industry.

Metric	Relevance Type	Multiple Relevant?	Position-Sensitive?	Best For
MRR	Binary	No	First result only	Q&A, entity lookup
MAP	Binary	Yes	Yes	Multi-doc retrieval
NDCG	Graded	Yes	All positions	Recommenders, search

Clustering Metrics

Clustering evaluation differs from supervised learning because you often lack ground truth labels. Two families of metrics handle this.

Internal Metrics (no labels required)

Silhouette Score: For each point i, compute a(i) = mean distance to points in the same cluster, b(i) = mean distance to the nearest other cluster. s(i) = (b(i) - a(i)) / max(a(i), b(i)).

Values range from -1 to 1. Close to 1: the point is well-matched to its cluster. Near 0: on the cluster boundary. Negative: likely misclustered. Rousseeuw (1987) introduced this to aid K selection and algorithm comparison.

Davies-Bouldin Index: Lower is better. Measures the average ratio of within-cluster scatter to between-cluster distance. Computationally cheaper than Silhouette on large datasets.

External Metrics (require ground truth labels)

Adjusted Rand Index (ARI): Corrects the Rand Index for chance. Range [-1, 1]; 1 = perfect agreement, 0 = random. Always prefer ARI over the raw Rand Index, which is biased toward 1 at high K.

Normalized Mutual Information (NMI): Measures shared information between predicted clusters and true labels, normalized to [0, 1]. Robust to differences in the number of clusters.

In practice: Use Silhouette and Davies-Bouldin for hyperparameter tuning (selecting K). Use ARI and NMI when you have a validation label set — for example, document categories or customer segments with known ground truth on a subset.

Cross-Validation: Honest Generalization Estimates

A single train/test split is fast but fragile — the result depends heavily on which samples ended up in which partition. Cross-validation averages over multiple splits to produce a more reliable estimate.

K-Fold Cross-Validation

Split data into K folds. Train on K-1, test on the remaining one. Rotate until every fold has been used as the test set once. Average the K scores.

K = 5 or 10 is standard. Kohavi (1995) showed K = 10 generally gives the best bias-variance trade-off. Use when data is large enough (1,000+ rows) and has no temporal or group structure.

Stratified K-Fold

Each fold preserves the original class ratio. This is mandatory for imbalanced datasets — a fold with no positive examples produces a meaningless recall score. Always use stratified K-fold for classification problems.

Time-Series Split (Walk-Forward Validation)

Never shuffle time-series data. Future data must not appear in the training set for any past fold. Walk-forward validation trains on [0, t] and tests on [t+1, t+w], expanding the training window each iteration.

scikit-learn's TimeSeriesSplit implements this correctly. Many practitioners still use standard K-fold on time-series data, inadvertently leaking future information into training and producing over-optimistic results.

Group K-Fold

Prevents data from the same entity — same user, same hospital, same subject — from appearing in both train and test. Critical for medical datasets and user-level models where the same entity contributes multiple rows.

Six Common Evaluation Mistakes

Mistake 1: Accuracy on imbalanced data

A model that always predicts the majority class on a 99%/1% dataset achieves 99% accuracy while catching zero positive cases. Fix: use F1, PR-AUC, or recall at a fixed precision threshold.

Mistake 2: Leaky preprocessing in cross-validation

Fitting StandardScaler on the full dataset before splitting introduces leakage — the test fold has already influenced the scaling parameters. The fix is to fit the scaler inside each training fold only. Use scikit-learn's Pipeline to enforce this automatically:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1')

Mistake 3: Target leakage

A feature is a proxy for the label — for example, diagnosis_code in a model predicting disease_present. The model learns the answer from the feature rather than from genuinely predictive patterns.

Warning signs: AUC above 0.95 on a hard problem, or feature importance concentrated in a single feature. Kaufman et al. (2012) and Kapoor & Narayanan (2023) document this as one of the primary drivers of the ML reproducibility crisis.

Mistake 4: Evaluating on the wrong distribution

Training on 2019–2022 data and testing on 2022 looks fine, but deploying in 2023 after a distribution shift reveals the model fails in production despite strong CV scores. Fix: hold out a temporal test set. Monitor production metrics continuously.

Mistake 5: Single-metric fixation

Reporting only AUC without threshold-specific metrics. In deployment you operate at one threshold, not across all thresholds. Fix: always report AUC alongside precision and recall at your operating threshold, plus the confusion matrix.

Mistake 6: Not testing the metric itself

Log loss computed on mislabeled samples, NDCG calculated with incorrectly constructed relevance judgments. Sanity-check the metric on known edge cases before training.

How to Choose the Right Metric

A five-step decision framework you can apply to any project:

Step 1 — Identify the task type:

Binary classification → F1, ROC-AUC, PR-AUC, Log Loss
Multi-class classification → macro/weighted F1, Accuracy (if balanced)
Regression → RMSE (outlier-sensitive), MAE (robust), MAPE (scale-free), R²
Ranking/retrieval → NDCG@K, MAP, MRR
Clustering → Silhouette (no labels), ARI/NMI (with labels)

Step 2 — Identify error asymmetry:

FP more costly than FN → prioritize Precision
FN more costly than FP → prioritize Recall
Symmetric costs → F1, Accuracy (if balanced), RMSE

Step 3 — Check class balance:

Balanced → Accuracy + macro F1
Imbalanced → PR-AUC + weighted F1 or recall at fixed precision

Step 4 — Do probabilities matter?

Yes (risk scoring, medical) → Log Loss + calibration plot
No (only the label matters) → Accuracy, F1

Step 5 — Connect to business KPI:

Problem	Primary Metric	Maps to Business KPI
COVID screening	Recall ≥ 95%	Case detection rate
Credit card fraud	Recall at Precision ≥ 50%	Investigation cost per missed fraud
E-commerce recommender	NDCG@10	Click-through rate (A/B validated)
House price prediction	RMSE + MAPE	Valuation error cost
Customer churn	F1 + ROC-AUC at 70% recall	Retention team capacity

Metrics are proxies, not ground truth. AUC can be high while production performance is poor. Accuracy can look strong while the model is completely useless for its actual purpose. The question that cuts through all of this is consistent: does this number reflect what actually matters to the end user and the business?

Once you can answer that cleanly — and trace a line from the metric on your evaluation report to a real consequence in the world — you are evaluating models the right way.

Key takeaways:

Every metric compresses information with trade-offs — choosing the right metric must start from the business problem, not the algorithm
Accuracy is only meaningful with balanced classes; for imbalanced data use F1, PR-AUC, or recall at fixed precision
For ranking/recommendation: NDCG@K is the standard; for regression: combine RMSE (penalizes large errors) with MAE (robust to outliers)
Data leakage and target leakage are the most common reasons metrics look great in validation but fail in production
Always map your technical metric to a specific business KPI — if you can't make that connection, the metric has no actionable meaning

machine-learningevaluationmetricsbasics

Sources