| Name | Typ. value | Description |
|---|---|---|
| Rung 1 — Significance | 95% CI above zero · t ≥ 2 · p_adj < 0.05 · ML: ≥80% of windows positive | Is the effect real or noise? ONLY the multiple-testing-corrected p-value (Šidák, p_adj) counts — the raw p is worthless after a parameter grid. Example catch: divergences (2,900+ cells, best p=0.31). |
| Rung 2 — Economics (ceiling) | bps net/trade × trades/year ≥ ~5%/year unlevered · costs ~9 bps round-trip | Significant ≠ big. A real but tiny edge dies to slippage and stop noise. Example catch: OID@4h — p_adj 0.027, but +10.85 bps × 28 signals/year = 0.3%/year → as a strategy, done_no_winners. |
| Rung 3 — Parameter plateau | plateau_pct ≥ ~0.7, contiguous | Does a broad area of the parameter grid work, or just an isolated spike next to red cells? Spike = overfit, plateau = structure. |
| Rung 4 — Time stability | ≥2/3 of OOS windows positive · check wf_min · NO single window carries >50% of the return | Read the distribution across the time folds, never the mean. Example catch: archetypes A/B — sfp harvested only BULL_2021H1, bb_extreme only Y2021. min_window_wins=2 in training is mandatory. |
| Rung 5 — Selection honesty | PBO < 0.2 · the family survives ≥2 independently configured sweeps | PBO measures whether the SELECTION of the winner generalizes (0.5 = coin flip, >0.5 = counterproductive) — per sweep, not per config. Example catch: ema@1d — wf_avg 8.69 in the first sweep, 0.53 in the reproduction sweep (PBO 0.77). |
| Rung 6 — Monte Carlo (risk) | prob_profit ≥ 0.9 · RoR ~0 · know MC-p5 · halve max_safe_leverage | AFTER the gates, never as validation. MC answers 'how does the risk feel', not 'is the edge real'. EV-CI above zero = a distinguishable edge. |
| Rung 7 — Shuffle test (filters only) | z ≥ 1.5 vs. shuffled data source | A real re-backtest with a shuffled source, NOT post-hoc on the trades (invalid for re-routing). Filter benefit is conditional: bocpd helps bb_extreme (+17%), hurts donchian (−15%); vrp loses its contribution with an adaptive exit. |
Pros
- A fixed order ends the 'but it looks green' debate: it stops at the first red rung
- Every rung has a documented catch as proof of existence — the ladder is empirical, not academic
- The verdict vocabulary becomes uniform: confirmed / pursue / inconclusive / refuted / downgraded have defined conditions
Cons
- The ladder does not protect against regime change: 'confirmed' means 'survived the past in every respect', not 'will surely earn in future'
- The reproduction criterion (rung 5) costs real compute time — two sweeps per family
- Applied strictly, little remains (currently: 1 directional family) — that is a feature, but it feels sparse
Interactive: click a rung or a red box → details · scroll wheel/buttons zoom, dragging pans · dotted terms show simple explanations, linked ones lead to the knowledge article. Open in fullscreen
The problem
By now we have four evaluation tools, each with its own metrics: the Indicator Lab (edge_bps, t, p_adj, plateau, folds), the megasweep funnel (phase scores, walk-forward, PBO), Monte Carlo (prob_profit, RoR, percentiles), and the shuffle/IC harness in the ML module. Without a reading guide, everything looks equally "green" — and that is exactly how ema@1d and OID briefly slipped through as candidates. This article is the reading guide.
First: WHAT is being claimed? Four claim types, four tools
| Claim | Example | Tool | Core metrics |
|---|---|---|---|
| "Signal predicts movement" | OID pattern → next bar | Indicator Lab | edge_bps, CI, p_adj, folds |
| "Strategy makes money" | donchian+gate+exit @4h | Megasweep funnel | wf windows, PBO, then MC |
| "Filter improves strategy" | vrp_filter on Donchian | Shuffle test | z ≥ 1.5 vs. shuffled source |
| "Model predicts size" | vol forecast | Walk-forward IC | IC + window stability |
The most common reading error is to take evidence of one type as proof for another. OID is the object lesson: a confirmed signal (type 1) is far from a confirmed strategy (type 2).
The evaluation ladder — in this order, stop on red
Rung 1 — Is the effect real? (significance)
Where: Indicator Lab scorecard; ML reports. Read: Does the 95% confidence interval lie entirely above zero? (The single most important number — if the CI includes zero, everything else is decoration.) Then the t-stat (roughly ≥ 2) and p_adj (< 0.05) — only the Šidák-corrected p-value counts; whoever tried 30 grid cells drew 30 lottery tickets, and the raw p ignores that. For ML models: IC plus window stability (21/21 is an argument, 11/21 is not). And always check n — t=2 on 20 samples is an anecdote.
Died here: divergences (3 walk-forwards, 2,900+ cells, best p=0.31), RSI failure swings, directional prediction in general (best IC +0.02).
Rung 2 — Is it BIG enough? (economics)
Calculation: bps net/trade × trades/year = gross ceiling, against ~5%/year unlevered as a rule of thumb. Cost hurdle: ~9 bps round-trip are already deducted when net_bps is shown — otherwise subtract them yourself.
Died here (and specifically AFTER passing rung 1): OID@4h — p_adj 0.027, 6/6 folds, all real. But +10.85 bps × ~28 signals/year ≈ +0.3%/year. The validation sweep confirmed it brutally: 2/96 variants positive, best +0.10% over two years. "Robust" answers significance, not size.
Rung 3 — Plateau or lucky hit? (parameter robustness)
Where: Stage-2 heatmap, plateau_pct. Read: A real edge forms a contiguous positive area (≥ ~70% of the cells); overfitting shows up as an isolated spike next to red cells. If you only earn at exactly BB period 17 and std 1.9, you have found nothing.
Rung 4 — Does it hold over time? (walk-forward)
Where: Indicator Lab folds_positive/folds_total; megasweep phase 3 (wf_avg, wf_min, worst window). Read: The distribution, never the mean. Three questions: How many windows are positive? How bad is the worst? And the most important: Does ONE window carry the entire return? If yes → one-window wonder, no matter how good the average looks.
Died here: the archetypes A/B — sfp pulled almost everything from BULL_2021H1, bb_extreme only from Y2021; the tight 0.5% fixed-stop Donchian won only in 2024 and lost in 2025. That is why min_window_wins=2 (profitable in two different regime years) is mandatory in every sweep today.
Rung 5 — Did we fool ourselves in the selection? (PBO + reproduction)
Where: pbo.json per sweep. Read: PBO is the probability that the selection of the in-sample winner does not generalize. < 0.2 trustworthy, ~0.5 coin flip, > 0.5 actively counterproductive. Two traps: (a) PBO applies to the sweep (the selection), not to the individual config; (b) a low PBO means "honest", not "earns a lot" — a strategy that is honestly flat also has a nice PBO.
The tightening from the ema lesson: A family only counts as a candidate once it survives two independently configured sweeps. ema@1d had wf_avg 8.69 / prob_profit 0.974 in the first sweep — and 0.53 / 0.77 in the reproduction sweep with a larger exit search space (PBO 0.77). A sweep is one sample of the selection, not a proof. Donchian@4h is so far the only family that has passed this (PBO 0.33 → 0.164).
Rung 6 — How does the risk feel? (Monte Carlo)
Where: montecarlo.json / final_rankings mc. Only AFTER rungs 1-5 — MC is risk characterization, never validation (it resamples the existing trades; if those are overfit, the MC result is pretty and worthless). Read: prob_profit (≥ 0.9 strong), p5/p50/p95 of the return (p5 = "this is how bad a normal year can look"), DD distribution, risk_of_ruin (~0), max_safe_leverage — halve it live (window concentration makes MC optimistic).
Passed: donchian@4h P2_14239 — pp 0.933, p50 +8.7%, p5 −0.8%, RoR 0.
Rung 7 — Special path for filters: the shuffle test
Filter claims ("X improves Y") need the shuffle test: a real re-backtest with a shuffled data source, z ≥ 1.5. Do not shuffle post-hoc on the finished trades — for re-routing strategies that is invalid. And the result is conditional, never absolute: bocpd helps bb_extreme (+17%) and hurts donchian (−15%); vrp helped donchian with a tight fixed stop (+37.7%) and lost its entire OOS contribution with an adaptive ATR exit. A validated filter is "validated for this entry with this exit" — nothing beyond that.
Red flags — distrust immediately, no matter what the numbers say
- In-sample score as an argument. The 5m hurst+macd trap: phase-2 score 112, MC prob_profit 0.0015.
- Single-window training. Faked a PBO of 0.043 on MS_0429; the counter-check tore it apart.
- min_trades < 30 / a winner with 3 lucky trades (+21.7% on 3 trades, MS_0601).
- Timeframe ≤ 15m. Reliably overfits (15m: 0/144 profitable).
- MC presented before the overfit gates.
- Smoothed models (hmmlearn
predict_probauses the future — only causal forward filters). - Retrospectively stamped data sources without a residual check (ETF flow: −70% of the signal was pre-event momentum).
- More than ~3 stacked filters (every knob enlarges the overfit surface; visible in the 0.66-PBO sweeps).
- "0 results" not counter-checked with a sanity sample.
Uniform verdict vocabulary (from now on, everywhere)
| Verdict | Condition | Botty examples |
|---|---|---|
| CONFIRMED / PROMOTED | all relevant rungs green; strategies additionally: reproduction (rung 5b) | vol forecast (IC +0.84, 21/21), BOCPD, donchian@4h family |
| PURSUE | rung 1 green, rest open or partly redundant | hmm_regime (IC +0.39, redundant with rv_4h?) |
| INCONCLUSIVE | neither confirmed nor cleanly refuted | regime_clustering (spread 32 bps, t −0.68) |
| REFUTED / DEAD | one rung definitively red, with sufficient n | divergences (rung 1), OID-as-strategy (rung 2), regime_switch (rung 4) |
| DOWNGRADED | was green, flipped under reproduction/re-evaluation | ema@1d (rung 5), archetypes A/B (rung 4), ETF flow (residual) |
Where the numbers are
- Indicator Lab:
/backtests→ Indicator Lab — scorecard per indicator × TF (rungs 1-4 for signals) - Megasweep:
/backtests→ archive —final_rankings,pbo.json,montecarlo.jsonper sweep (rungs 4-6) - ML experiments:
/ml— reports with IC + window stability + verdict - Syntheses: What Demonstrably Works in Botty — Evidence Ranking Across All Sweeps, Tests & ML Experiments (what remains), Megasweep PBO synthesis: the edge sits in the regime gate, not the entry trigger (the PBO landscape), What provably does NOT work in retail trading (and why) (external negative list)
The ladder at a glance
1 real? (CI > 0, p_adj < 0.05) → 2 big enough? (bps × frequency ≥ 5%/yr) → 3 plateau? (≥ 70%) → 4 time-stable? (window distribution, no one-window) → 5 selection honest? (PBO < 0.2 and a 2nd sweep) → 6 risk ok? (MC: pp, p5, RoR) → 7 filter: shuffle z ≥ 1.5. At the first red rung: stop, write the verdict, move on.