Botty · Live-Readiness

Wann ist ein Backtest live-tauglich? (Bewertungs-Pipeline)

„Promising“ is not a single pretty number but passing five stages in sequence: enough trades → solid individual metrics → high mega-sweep score → positive walk-forward → low PBO → survived Monte-Carlo worst case. Only then is an entry/filter/exit fit for live.

What this is about

The most common self-deception in backtesting: taking a pretty equity curve from one run at face value. A backtest is only one path of history — it can shine through luck or data mining. "Promising" therefore does not mean one good number, but passing a pipeline of stages, each of which rules out a different way of lying.

The five stages

Stage 0 — Is the sample big enough?

Below ~30 trades every metric is noise. Botty's mega-sweep sets a hard cut: trades ≥ 30 (training), otherwise score = 0.

Stage 1 — Single-backtest metrics (first look)

Glance over them quickly, but believe none of them on their own: - Return % — absolute profit over the period - Profit Factor — gross profit ÷ gross loss; Botty gate ≥ 1.1, "good" from ~1.3 - Max Drawdown — deepest decline; the single-backtest DD often underestimates the real one by a factor of 2–3 - Calmar / MAR — return ÷ drawdown; risk efficiency - Sharpe, Win Rate — context, not a verdict

⚠️ One backtest = one path of history. On its own it proves nothing.

Stage 2 — Mega-sweep score (training ranking)

Botty's compute_score (backtesting/megasweep.py) condenses this into a single number with hard gates — below them the score is 0:

Gates:  trades ≥ 30   ·   return_pct ≥ 1.0%   ·   profit_factor ≥ 1.1
Score = return_pct × MAR × pf_bonus
        MAR      = min(return / max(DD, 1%), 5)     # DD floored, capped
        pf_bonus = min(pf / 2, 1.5)

A high score = good in the training period. But it still says nothing about unseen data — this is where overfitting lurks.

Stage 3 — Walk-forward (phase 3): the most important gate

Does the structure win in several unseen time windows as well (6 mo training → 2 mo test, rolling)? Botty ranks "all-profitable-first" by the worst-case window. Goal: positive in the majority of windows (the live strategy BB_EXTREME managed 7/9, for instance). Negative in the majority → out, no matter how pretty the training score was.

Stage 4 — PBO: was it just luck?

If you pick the best out of many candidates, almost always one of them wins by chance. PBO (Probability of Backtest Overfitting, CSCV after Bailey & López de Prado, backtesting/cpcv.py) measures exactly that: - PBO < 0.5 → the selection generalizes (good) - < 0.3 → strong - ≥ 0.5 → the search is pure overfit / data snooping → discard

Stage 5 — Monte Carlo: does it survive the worst case?

The single backtest shows one drawdown. Monte Carlo (trade reshuffle/bootstrap) shows the distribution of possible paths: - Risk of Ruin ≤ 5 % (industry rule of thumb) - Position sizing computed backward from the 95th-percentile drawdown, not from the (lucky) single-backtest DD

The traffic light (as it appears in the regime table)

Maturity	Meaning
🟢 validated	all five stages passed → live-ready (e.g. donchian_breakout, bb_extreme)
🟡 under review	good in-sample, but walk-forward / PBO / MC still open or borderline
🔴 discarded	walk-forward negative in the majority or PBO ≥ 0.5 (e.g. macd_crossover)
⚪ untested	no systematic sweep + walk-forward has been run

Quick reference

Metric	Botty gate	Rule of thumb "good"
Trades	≥ 30	≥ 100
Return % (training)	≥ 1.0 %	context-dependent
Profit Factor	≥ 1.1	≥ 1.3
Calmar / MAR	—	≥ 2
Sharpe	—	≥ 1.5 (backtest)
WF windows positive	majority	≥ 7/9
PBO	< 0.5	< 0.3
Risk of Ruin	≤ 5 %	≤ 2 %
Max DD (real)	—	derive from the MC 95th percentile

The lesson

Most of the "winners" from stages 1–2 die in stages 3–4. Botty's own history is full of it: the RSI/vol divergences (debunked 3x in walk-forward), vol_regime_transitions (DEAD), macd_crossover (negative in 10/11 windows). That is exactly what the pipeline is for — it separates substance from luck before real money flows.

The negative flip side — how to spot an unfit idea early — is covered under What doesn't work in retail trading (and why).