What this is about
The most common self-deception in backtesting: taking a pretty equity curve from one run at face value. A backtest is only one path of history — it can shine through luck or data mining. "Promising" therefore does not mean one good number, but passing a pipeline of stages, each of which rules out a different way of lying.
The five stages
Stage 0 — Is the sample big enough?
Below ~30 trades every metric is noise. Botty's mega-sweep sets a hard cut: trades ≥ 30 (training), otherwise score = 0.
Stage 1 — Single-backtest metrics (first look)
Glance over them quickly, but believe none of them on their own: - Return % — absolute profit over the period - Profit Factor — gross profit ÷ gross loss; Botty gate ≥ 1.1, "good" from ~1.3 - Max Drawdown — deepest decline; the single-backtest DD often underestimates the real one by a factor of 2–3 - Calmar / MAR — return ÷ drawdown; risk efficiency - Sharpe, Win Rate — context, not a verdict
⚠️ One backtest = one path of history. On its own it proves nothing.
Stage 2 — Mega-sweep score (training ranking)
Botty's compute_score (backtesting/megasweep.py) condenses this into a single number with hard gates — below them the score is 0:
Gates: trades ≥ 30 · return_pct ≥ 1.0% · profit_factor ≥ 1.1
Score = return_pct × MAR × pf_bonus
MAR = min(return / max(DD, 1%), 5) # DD floored, capped
pf_bonus = min(pf / 2, 1.5)
A high score = good in the training period. But it still says nothing about unseen data — this is where overfitting lurks.
Stage 3 — Walk-forward (phase 3): the most important gate
Does the structure win in several unseen time windows as well (6 mo training → 2 mo test, rolling)? Botty ranks "all-profitable-first" by the worst-case window. Goal: positive in the majority of windows (the live strategy BB_EXTREME managed 7/9, for instance). Negative in the majority → out, no matter how pretty the training score was.
Stage 4 — PBO: was it just luck?
If you pick the best out of many candidates, almost always one of them wins by chance. PBO (Probability of Backtest Overfitting, CSCV after Bailey & López de Prado, backtesting/cpcv.py) measures exactly that:
- PBO < 0.5 → the selection generalizes (good)
- < 0.3 → strong
- ≥ 0.5 → the search is pure overfit / data snooping → discard
Stage 5 — Monte Carlo: does it survive the worst case?
The single backtest shows one drawdown. Monte Carlo (trade reshuffle/bootstrap) shows the distribution of possible paths: - Risk of Ruin ≤ 5 % (industry rule of thumb) - Position sizing computed backward from the 95th-percentile drawdown, not from the (lucky) single-backtest DD
The traffic light (as it appears in the regime table)
| Maturity | Meaning |
|---|---|
| 🟢 validated | all five stages passed → live-ready (e.g. donchian_breakout, bb_extreme) |
| 🟡 under review | good in-sample, but walk-forward / PBO / MC still open or borderline |
| 🔴 discarded | walk-forward negative in the majority or PBO ≥ 0.5 (e.g. macd_crossover) |
| ⚪ untested | no systematic sweep + walk-forward has been run |
Quick reference
| Metric | Botty gate | Rule of thumb "good" |
|---|---|---|
| Trades | ≥ 30 | ≥ 100 |
| Return % (training) | ≥ 1.0 % | context-dependent |
| Profit Factor | ≥ 1.1 | ≥ 1.3 |
| Calmar / MAR | — | ≥ 2 |
| Sharpe | — | ≥ 1.5 (backtest) |
| WF windows positive | majority | ≥ 7/9 |
| PBO | < 0.5 | < 0.3 |
| Risk of Ruin | ≤ 5 % | ≤ 2 % |
| Max DD (real) | — | derive from the MC 95th percentile |
The lesson
Most of the "winners" from stages 1–2 die in stages 3–4. Botty's own history is full of it: the RSI/vol divergences (debunked 3x in walk-forward), vol_regime_transitions (DEAD), macd_crossover (negative in 10/11 windows). That is exactly what the pipeline is for — it separates substance from luck before real money flows.
The negative flip side — how to spot an unfit idea early — is covered under What doesn't work in retail trading (and why).