← Writing

data-science · English · 8 min

🇻🇳 Đọc tiếng Việt

SHAP Is Not Just for Explanation — It's for Debugging Your Model

April 16, 2026

In this sales forecasting project, SHAP wasn't just a tool for explaining model decisions to stakeholders — it was a microscope for finding specific model failures. Weather features contributed 0.4%. Store-level factors sometimes overpredict. SHAP caught both.

Have you ever added a group of features to a model, watched the accuracy tick slightly upward, and called it done? Take weather data in a retail sales forecasting pipeline: the intuition is sound — rain affects foot traffic, temperature affects purchasing behavior — so including it feels responsible. Most practitioners use SHAP the same way: generate a summary bar chart, put it in the stakeholder deck, and move on. But SHAP can do something much harder to replicate with any other tool — it can show you precisely where your model is getting things wrong. In my sales_forecasting_xai project, SHAP surfaced two specific failure modes that aggregate prediction metrics never would have caught.


What the Project Does

Some context before we get into the SHAP findings.

This is a proof-of-concept system for store-level, item-specific daily sales forecasting. The dataset spans two years of retail sales (2016–2017) across multiple provinces and stores, enriched with weather signals. The technical stack: LightGBM as the core model, Optuna for hyperparameter tuning, SHAP TreeExplainer for explainability, and a Streamlit interface so non-technical stakeholders can interact with forecasts directly.

The project's structure reflects a deliberate philosophy: interpretability is not an afterthought. The docs/ directory holds two structured Markdown reports — one defining the PoC scope and methodology, one summarizing SHAP findings with quantified numbers. Alongside five sequentially-numbered notebooks covering the full analytical pipeline, there's a src/ package with separated modules (data_loader, ui_builder, ui_predictor) — because notebooks alone don't scale to handoff.


Feature Engineering: Multiple Time Horizons

To understand the SHAP output, you first need to see how features were constructed.

I engineered lag features across five temporal resolutions: 1, 7, 14, 21, and 28 days. Each window captures a different demand signal:

Lag windowSignal captured
1 dayShort-term volatility, sudden spikes
7 daysWeekly cycles (weekends vs. weekdays)
14–21 daysMid-term trends
28 daysMonthly periodicity, longer trend signals

This multi-scale design was a deliberate decision, documented in the PoC specification — not an ad hoc addition discovered mid-notebook. Rolling averages computed over these windows, combined with item-level and store-level features, form the core feature space. Weather features (temperature, precipitation, etc.) were added as a separate group to test the hypothesis that meteorological signals matter at store-day granularity.

The primary evaluation metric is WAPE (Weighted Absolute Percentage Error), alongside MAE and RMSE. WAPE is more robust than MAPE for retail datasets because it doesn't break when sales values are zero or near-zero — a domain-aware choice that matters in practice.


SHAP as Microscope, Not Slide Deck

Here is where the project gets interesting.

Finding 1: Weather Contributes 0.4%

After running SHAP TreeExplainer and examining the global feature importance rankings, one number stood out: weather features collectively accounted for approximately 0.4% of total predictive power. Item-level rolling averages — historical mean sales per product — captured roughly 50% of total feature importance.

If you only look at aggregate WAPE, you won't see this. The model performs well. Weather features don't actively harm accuracy, but SHAP makes plain that the model has essentially learned nothing from them. This is the critical distinction between "a feature is in your dataset" and "a feature carries information the model can use."

The practical consequence: the weather data integration pipeline — API calls, join logic, imputation for missing values — carries ongoing engineering complexity that contributes essentially nothing to forecast quality. Without SHAP, this fact could sit quietly in the pipeline indefinitely, because no prediction metric would flag it.

import shap
 
explainer = shap.TreeExplainer(lgb_model)
shap_values = explainer.shap_values(X_test)
 
# Global summary — immediately see which features are pulling weight
shap.summary_plot(shap_values, X_test, plot_type="bar")

The summary bar chart ranks all features by mean absolute SHAP value. Weather features will appear near the bottom — and that's the moment you start asking whether their pipeline complexity is justified.

Finding 2: Strong-Item / Weak-Store Misalignment

SHAP also revealed a subtler problem at the local level: store-level factors sometimes cause overprediction, specifically in cases where a high-performing item (strong item) sits in a low-performing store (weak store).

When examining SHAP waterfall plots for individual predictions in this segment, the pattern is clear: item-level rolling average features push the prediction toward high values, while store-level features don't pull back hard enough to compensate. The result is systematic overprediction for a specific slice of the data.

This is exactly the kind of failure that no aggregate metric will catch unless you deliberately slice your error analysis by that segment. SHAP local explanations surface it naturally — you look at one prediction, see the contribution of each feature, and recognize the pattern.

# Local explanation for a specific prediction
shap.waterfall_plot(shap.Explanation(
    values=shap_values[i],
    base_values=explainer.expected_value,
    data=X_test.iloc[i],
    feature_names=X_test.columns.tolist()
))

The waterfall plot shows each feature either pushing the prediction up or pulling it down, from the base value to the final output. In strong-item/weak-store cases, you'll see item rolling average features driving a large positive shift while store features apply only a weak corrective force.


What These Findings Actually Mean

Neither of these findings is a project failure — they are the project's most valuable output. A good PoC doesn't just demonstrate that a problem is solvable. It maps the risk terrain and identifies exactly what questions a Phase 2 effort would need to answer.

The SHAP report in docs/ does this explicitly: it records specific numbers (0.4% weather importance, ~50% item rolling features), flags feature-weight imbalances as an open problem, and raises the question of whether weather pipeline complexity is justified. That is the kind of documentation an engineering team needs to decide whether to proceed to production — and if so, what to fix first.

The Streamlit app serves a different purpose: it gives the business side a way to interact with forecasts and SHAP explanations for specific stores and items. Non-technical stakeholders can ask "why is this store being forecast so high?" and get an answer grounded in feature contributions. In retail forecasting, this kind of per-prediction accountability is often more valuable than an improved WAPE score.


Common Mistakes When Using SHAP

1. Using SHAP only for presentations, not for debugging

The global summary plot is useful for communication, but the real diagnostic value is in local explanations and dependency plots. If you generate a summary bar chart and stop there, you're skipping the part that actually improves your model.

2. Confusing "low SHAP value" with "irrelevant feature in the real world"

Weather features having near-zero SHAP values doesn't necessarily mean weather doesn't influence retail sales. It means this model, with this feature engineering, failed to learn that signal. The problem might be feature representation — the wrong aggregation window, missing interaction terms — not the feature's fundamental relevance.

3. Trusting aggregate metrics without segment-level error analysis

Strong-item/weak-store misalignment is invisible in WAPE averaged across the entire dataset. Always slice error analysis by key categorical dimensions (store tier, item category, season) before concluding a model is working correctly.

4. Leaving SHAP analysis in notebooks instead of structured docs

SHAP findings are business-critical information. If they only exist as notebook output, they will be forgotten. The docs/ pattern in this project — a separate SHAP report with specific numbers — is what makes findings actionable beyond the analysis phase.

5. Running SHAP on the full test set and drawing immediate conclusions

SHAP values can be dominated by high-frequency samples. Consider sampling strategically — prioritize edge cases, outliers, and the segments you care most about — rather than running on a flat random sample.

6. Ignoring SHAP interaction values

In retail forecasting, item and store features frequently interact (a specific item performs well at a specific store, not universally). shap.dependence_plot with interaction_index can reveal these interactions that summary plots miss entirely.


Key takeaways:

  • SHAP TreeExplainer produces both global feature rankings and local per-prediction decompositions — you need both, not one or the other
  • In this project, item-level rolling averages drove ~50% of feature importance while weather features contributed just 0.4% — SHAP turned those numbers into evidence for a pipeline decision
  • Local SHAP explanations caught a strong-item/weak-store misalignment that aggregate WAPE metrics could not see
  • A good PoC documents failure modes as clearly as it demonstrates successes — that documentation is what enables a Phase 2 decision
  • A docs/ folder with a dedicated SHAP report, separate from notebook output, is what keeps findings alive after the analysis sprint ends
shaplightgbmxaisales-forecastingoptunafeature-engineering

Sources

  1. nguyenhads/sales_forecasting_xai — GitHub