Knowledge · Terms · PBO

PBO

Indicator concept
Probability of Backtest Overfitting
Sweep-level metric: how likely the best backtest candidate won by luck alone and flops out-of-sample. From all C(S,S/2) in-/out-of-sample splits of a [slices × candidates] matrix (CSCV, Bailey & López de Prado). ≈0 = selection generalizes; ≈0.5 = pure chance; >0.5 = anti-predictive.

What this is about

PBO answers a question that a single backtest never asks: Did we pick the winner out of N candidates purely by chance? It is a sweep-level metric (one per selection), not one per strategy. Developed by Bailey & López de Prado et al. via the CSCV method (Combinatorially-Symmetric Cross-Validation).

When a mega-sweep tests thousands of parameter combinations, almost always one of them looks great in the backtest — through pure luck. PBO estimates the probability that the in-sample best candidate lands below the median out-of-sample, i.e. fails to generalize.

The problem it solves

Walk-Forward and Out-of-Sample ask: Does THIS one winner hold up across multiple time windows? That misses the selection bias: with enough candidates, you will always find one that also gets through all windows by chance. PBO asks the bigger question: Was the SELECTION of this winner out of N candidates reliable at all — or data snooping?

How it is computed (CSCV)

  1. Build a performance matrix M of shape [S time slices × N candidates]: each candidate (a Phase-2 finalist with fixed parameters) is backtested on each of S equally long, contiguous slices of the history.
  2. For every partition of the S slices into an in-sample half (IS) and its complementary OOS half — that is C(S, S/2) splits:
  3. rank the candidates by mean IS performance, take the IS best,
  4. determine its relative rank ω in the OOS half (ω ∈ (0,1)),
  5. logit λ = ln(ω / (1 − ω)); λ ≤ 0 means: the IS best lands in the lower half OOS.
  6. PBO = fraction of splits with λ ≤ 0.

Botty uses the fixed-params variant (candidates keep their swept parameters, no re-optimization per split). Cost: ~N × S backtests — comparable to a Phase-3 run.

Interpretation

PBO Meaning
≈ 0.0 Picking the IS winner reliably also picks an OOS winner — the search is not overfit.
≈ 0.5 The selection is no better than a coin flip — pure overfit / data snooping. The backtest says nothing about the future.
> 0.5 The search is anti-predictive: the IS winner is actually more likely to be worse than the median OOS.

How Botty uses it

  • backtesting/cpcv.pycscv_pbo() (pure math) + compute_sweep_pbo() (builds the matrix via backtests). Default n_slices = 16, candidates = the sweep's Phase-2 finalists.
  • The result lives as pbo.json in the sweep directory; additional outputs: median_logit and oos_degradation_mean (mean OOS performance of the IS best minus the OOS median).
  • In the mega-sweep archive list (backtesting/ui.py) there is a PBO button per sweep; the result appears as a colored banner: < 0.2 green ("winner holds up in new periods"), < 0.5 yellow ("only partly reliable"), ≥ 0.5 red ("winner was probably luck").
  • Computed on demand and processed serially in a queue (a single PBO computation is itself N × 16 backtests). Requires ≥ 2 finalists and a completed Phase 3.

How it differs from MC and walk-forward

  • Walk-forward / OOS: test one candidate across regime windows. PBO tests the selection from many.
  • Monte Carlo (block bootstrap of trade PnLs): describes the risk distribution of a fixed trade set (drawdown, ruin). PBO describes the overfit risk of the selection. They answer different questions and do not substitute for each other.

Trade-offs

✅ The only metric that directly measures the selection bias of a sweep — exactly the risk that Phase 3 does not cover. ✅ Robust against lucky winners: uses all C(S, S/2) splits instead of a single holdout.

❌ Meaningful only with enough candidates (N ≥ ~10) and enough slices (even S ≥ 4); unstable with few finalists. ❌ The fixed-params variant does not re-optimize per split — it measures overfit of the selection, not of the parameter fitting itself. ❌ Computationally expensive (N × S backtests per sweep).