What this is about
PBO answers a question that a single backtest never asks: Did we pick the winner out of N candidates purely by chance? It is a sweep-level metric (one per selection), not one per strategy. Developed by Bailey & López de Prado et al. via the CSCV method (Combinatorially-Symmetric Cross-Validation).
When a mega-sweep tests thousands of parameter combinations, almost always one of them looks great in the backtest — through pure luck. PBO estimates the probability that the in-sample best candidate lands below the median out-of-sample, i.e. fails to generalize.
The problem it solves
Walk-Forward and Out-of-Sample ask: Does THIS one winner hold up across multiple time windows? That misses the selection bias: with enough candidates, you will always find one that also gets through all windows by chance. PBO asks the bigger question: Was the SELECTION of this winner out of N candidates reliable at all — or data snooping?
How it is computed (CSCV)
- Build a performance matrix M of shape [S time slices × N candidates]: each candidate (a Phase-2 finalist with fixed parameters) is backtested on each of S equally long, contiguous slices of the history.
- For every partition of the S slices into an in-sample half (IS) and its complementary OOS half — that is C(S, S/2) splits:
- rank the candidates by mean IS performance, take the IS best,
- determine its relative rank ω in the OOS half (ω ∈ (0,1)),
- logit λ = ln(ω / (1 − ω)); λ ≤ 0 means: the IS best lands in the lower half OOS.
- PBO = fraction of splits with λ ≤ 0.
Botty uses the fixed-params variant (candidates keep their swept parameters, no re-optimization per split). Cost: ~N × S backtests — comparable to a Phase-3 run.
Interpretation
| PBO | Meaning |
|---|---|
| ≈ 0.0 | Picking the IS winner reliably also picks an OOS winner — the search is not overfit. |
| ≈ 0.5 | The selection is no better than a coin flip — pure overfit / data snooping. The backtest says nothing about the future. |
| > 0.5 | The search is anti-predictive: the IS winner is actually more likely to be worse than the median OOS. |
How Botty uses it
backtesting/cpcv.py→cscv_pbo()(pure math) +compute_sweep_pbo()(builds the matrix via backtests). Default n_slices = 16, candidates = the sweep's Phase-2 finalists.- The result lives as
pbo.jsonin the sweep directory; additional outputs:median_logitandoos_degradation_mean(mean OOS performance of the IS best minus the OOS median). - In the mega-sweep archive list (
backtesting/ui.py) there is a PBO button per sweep; the result appears as a colored banner: < 0.2 green ("winner holds up in new periods"), < 0.5 yellow ("only partly reliable"), ≥ 0.5 red ("winner was probably luck"). - Computed on demand and processed serially in a queue (a single PBO computation is itself N × 16 backtests). Requires ≥ 2 finalists and a completed Phase 3.
How it differs from MC and walk-forward
- Walk-forward / OOS: test one candidate across regime windows. PBO tests the selection from many.
- Monte Carlo (block bootstrap of trade PnLs): describes the risk distribution of a fixed trade set (drawdown, ruin). PBO describes the overfit risk of the selection. They answer different questions and do not substitute for each other.
Trade-offs
✅ The only metric that directly measures the selection bias of a sweep — exactly the risk that Phase 3 does not cover. ✅ Robust against lucky winners: uses all C(S, S/2) splits instead of a single holdout.
❌ Meaningful only with enough candidates (N ≥ ~10) and enough slices (even S ≥ 4); unstable with few finalists. ❌ The fixed-params variant does not re-optimize per split — it measures overfit of the selection, not of the parameter fitting itself. ❌ Computationally expensive (N × S backtests per sweep).