Knowledge · Strategies · Evaluation Ladder: When Is Something Confirmed, When Refuted?

Evaluation Ladder: When Is Something Confirmed, When Refuted?

Botty internal — distilled from 22+ megasweeps, 45 ML experiments, the Indicator Lab, and the shuffle tests (April-June 2026). Every rule has a documented catch or failure as evidence.
Strategy anatomy Evidence: Very strong cross-cutting crypto
10/10
Relevance for Botty
The reading order for every evaluation at Botty: 7 rungs from 'is the effect real?' (significance: CI, t, p_adj) through 'is it big enough?' (bps × frequency ≥ ~5%/year), parameter plateau, time stability (no one-window wonder), selection honesty (PBO < 0.2 + reproduction in 2 sweeps) to Monte Carlo (risk characterization, never validation) and the shuffle test for filters. Rule: stop at the first red rung. Every rung has a documented Botty catch as evidence: divergences died at rung 1, OID at rung 2, the archetypes at rung 4, ema@1d at rung 5. Plus: a uniform verdict vocabulary and the red-flags list.
A fixed reading order of 7 rungs: (1) Significance — is the effect real? (2) Economics — is it big enough? (3) Parameter plateau — or a lucky-hit spike? (4) Time stability — or a one-window wonder? (5) Selection honesty — PBO + reproduction in 2 independent sweeps. (6) Monte Carlo — characterize risk, NEVER validate. (7) Shuffle test for filters. Stop at the first red rung: everything below it is meaningless, no matter how green it looks.
Relevance Score 10/10
NameTyp. valueDescription
Rung 1 — Significance 95% CI above zero · t ≥ 2 · p_adj < 0.05 · ML: ≥80% of windows positive Is the effect real or noise? ONLY the multiple-testing-corrected p-value (Šidák, p_adj) counts — the raw p is worthless after a parameter grid. Example catch: divergences (2,900+ cells, best p=0.31).
Rung 2 — Economics (ceiling) bps net/trade × trades/year ≥ ~5%/year unlevered · costs ~9 bps round-trip Significant ≠ big. A real but tiny edge dies to slippage and stop noise. Example catch: OID@4h — p_adj 0.027, but +10.85 bps × 28 signals/year = 0.3%/year → as a strategy, done_no_winners.
Rung 3 — Parameter plateau plateau_pct ≥ ~0.7, contiguous Does a broad area of the parameter grid work, or just an isolated spike next to red cells? Spike = overfit, plateau = structure.
Rung 4 — Time stability ≥2/3 of OOS windows positive · check wf_min · NO single window carries >50% of the return Read the distribution across the time folds, never the mean. Example catch: archetypes A/B — sfp harvested only BULL_2021H1, bb_extreme only Y2021. min_window_wins=2 in training is mandatory.
Rung 5 — Selection honesty PBO < 0.2 · the family survives ≥2 independently configured sweeps PBO measures whether the SELECTION of the winner generalizes (0.5 = coin flip, >0.5 = counterproductive) — per sweep, not per config. Example catch: ema@1d — wf_avg 8.69 in the first sweep, 0.53 in the reproduction sweep (PBO 0.77).
Rung 6 — Monte Carlo (risk) prob_profit ≥ 0.9 · RoR ~0 · know MC-p5 · halve max_safe_leverage AFTER the gates, never as validation. MC answers 'how does the risk feel', not 'is the edge real'. EV-CI above zero = a distinguishable edge.
Rung 7 — Shuffle test (filters only) z ≥ 1.5 vs. shuffled data source A real re-backtest with a shuffled source, NOT post-hoc on the trades (invalid for re-routing). Filter benefit is conditional: bocpd helps bb_extreme (+17%), hurts donchian (−15%); vrp loses its contribution with an adaptive exit.

Pros

  • A fixed order ends the 'but it looks green' debate: it stops at the first red rung
  • Every rung has a documented catch as proof of existence — the ladder is empirical, not academic
  • The verdict vocabulary becomes uniform: confirmed / pursue / inconclusive / refuted / downgraded have defined conditions

Cons

  • The ladder does not protect against regime change: 'confirmed' means 'survived the past in every respect', not 'will surely earn in future'
  • The reproduction criterion (rung 5) costs real compute time — two sweeps per family
  • Applied strictly, little remains (currently: 1 directional family) — that is a feature, but it feels sparse
Directly wired: the Indicator Lab delivers rungs 1-4 as a scorecard, the megasweep funnel enforces rungs 4-5 (min_window_wins, PBO), montecarlo.py delivers rung 6, the shuffle harness rung 7. The ladder is the reading guide across all tools.

Interactive: click a rung or a red box → details · scroll wheel/buttons zoom, dragging pans · dotted terms show simple explanations, linked ones lead to the knowledge article. Open in fullscreen

The problem

By now we have four evaluation tools, each with its own metrics: the Indicator Lab (edge_bps, t, p_adj, plateau, folds), the megasweep funnel (phase scores, walk-forward, PBO), Monte Carlo (prob_profit, RoR, percentiles), and the shuffle/IC harness in the ML module. Without a reading guide, everything looks equally "green" — and that is exactly how ema@1d and OID briefly slipped through as candidates. This article is the reading guide.

First: WHAT is being claimed? Four claim types, four tools

Claim Example Tool Core metrics
"Signal predicts movement" OID pattern → next bar Indicator Lab edge_bps, CI, p_adj, folds
"Strategy makes money" donchian+gate+exit @4h Megasweep funnel wf windows, PBO, then MC
"Filter improves strategy" vrp_filter on Donchian Shuffle test z ≥ 1.5 vs. shuffled source
"Model predicts size" vol forecast Walk-forward IC IC + window stability

The most common reading error is to take evidence of one type as proof for another. OID is the object lesson: a confirmed signal (type 1) is far from a confirmed strategy (type 2).

The evaluation ladder — in this order, stop on red

Rung 1 — Is the effect real? (significance)

Where: Indicator Lab scorecard; ML reports. Read: Does the 95% confidence interval lie entirely above zero? (The single most important number — if the CI includes zero, everything else is decoration.) Then the t-stat (roughly ≥ 2) and p_adj (< 0.05) — only the Šidák-corrected p-value counts; whoever tried 30 grid cells drew 30 lottery tickets, and the raw p ignores that. For ML models: IC plus window stability (21/21 is an argument, 11/21 is not). And always check n — t=2 on 20 samples is an anecdote.

Died here: divergences (3 walk-forwards, 2,900+ cells, best p=0.31), RSI failure swings, directional prediction in general (best IC +0.02).

Rung 2 — Is it BIG enough? (economics)

Calculation: bps net/trade × trades/year = gross ceiling, against ~5%/year unlevered as a rule of thumb. Cost hurdle: ~9 bps round-trip are already deducted when net_bps is shown — otherwise subtract them yourself.

Died here (and specifically AFTER passing rung 1): OID@4h — p_adj 0.027, 6/6 folds, all real. But +10.85 bps × ~28 signals/year ≈ +0.3%/year. The validation sweep confirmed it brutally: 2/96 variants positive, best +0.10% over two years. "Robust" answers significance, not size.

Rung 3 — Plateau or lucky hit? (parameter robustness)

Where: Stage-2 heatmap, plateau_pct. Read: A real edge forms a contiguous positive area (≥ ~70% of the cells); overfitting shows up as an isolated spike next to red cells. If you only earn at exactly BB period 17 and std 1.9, you have found nothing.

Rung 4 — Does it hold over time? (walk-forward)

Where: Indicator Lab folds_positive/folds_total; megasweep phase 3 (wf_avg, wf_min, worst window). Read: The distribution, never the mean. Three questions: How many windows are positive? How bad is the worst? And the most important: Does ONE window carry the entire return? If yes → one-window wonder, no matter how good the average looks.

Died here: the archetypes A/B — sfp pulled almost everything from BULL_2021H1, bb_extreme only from Y2021; the tight 0.5% fixed-stop Donchian won only in 2024 and lost in 2025. That is why min_window_wins=2 (profitable in two different regime years) is mandatory in every sweep today.

Rung 5 — Did we fool ourselves in the selection? (PBO + reproduction)

Where: pbo.json per sweep. Read: PBO is the probability that the selection of the in-sample winner does not generalize. < 0.2 trustworthy, ~0.5 coin flip, > 0.5 actively counterproductive. Two traps: (a) PBO applies to the sweep (the selection), not to the individual config; (b) a low PBO means "honest", not "earns a lot" — a strategy that is honestly flat also has a nice PBO.

The tightening from the ema lesson: A family only counts as a candidate once it survives two independently configured sweeps. ema@1d had wf_avg 8.69 / prob_profit 0.974 in the first sweep — and 0.53 / 0.77 in the reproduction sweep with a larger exit search space (PBO 0.77). A sweep is one sample of the selection, not a proof. Donchian@4h is so far the only family that has passed this (PBO 0.33 → 0.164).

Rung 6 — How does the risk feel? (Monte Carlo)

Where: montecarlo.json / final_rankings mc. Only AFTER rungs 1-5 — MC is risk characterization, never validation (it resamples the existing trades; if those are overfit, the MC result is pretty and worthless). Read: prob_profit (≥ 0.9 strong), p5/p50/p95 of the return (p5 = "this is how bad a normal year can look"), DD distribution, risk_of_ruin (~0), max_safe_leveragehalve it live (window concentration makes MC optimistic).

Passed: donchian@4h P2_14239 — pp 0.933, p50 +8.7%, p5 −0.8%, RoR 0.

Rung 7 — Special path for filters: the shuffle test

Filter claims ("X improves Y") need the shuffle test: a real re-backtest with a shuffled data source, z ≥ 1.5. Do not shuffle post-hoc on the finished trades — for re-routing strategies that is invalid. And the result is conditional, never absolute: bocpd helps bb_extreme (+17%) and hurts donchian (−15%); vrp helped donchian with a tight fixed stop (+37.7%) and lost its entire OOS contribution with an adaptive ATR exit. A validated filter is "validated for this entry with this exit" — nothing beyond that.

Red flags — distrust immediately, no matter what the numbers say

  • In-sample score as an argument. The 5m hurst+macd trap: phase-2 score 112, MC prob_profit 0.0015.
  • Single-window training. Faked a PBO of 0.043 on MS_0429; the counter-check tore it apart.
  • min_trades < 30 / a winner with 3 lucky trades (+21.7% on 3 trades, MS_0601).
  • Timeframe ≤ 15m. Reliably overfits (15m: 0/144 profitable).
  • MC presented before the overfit gates.
  • Smoothed models (hmmlearn predict_proba uses the future — only causal forward filters).
  • Retrospectively stamped data sources without a residual check (ETF flow: −70% of the signal was pre-event momentum).
  • More than ~3 stacked filters (every knob enlarges the overfit surface; visible in the 0.66-PBO sweeps).
  • "0 results" not counter-checked with a sanity sample.

Uniform verdict vocabulary (from now on, everywhere)

Verdict Condition Botty examples
CONFIRMED / PROMOTED all relevant rungs green; strategies additionally: reproduction (rung 5b) vol forecast (IC +0.84, 21/21), BOCPD, donchian@4h family
PURSUE rung 1 green, rest open or partly redundant hmm_regime (IC +0.39, redundant with rv_4h?)
INCONCLUSIVE neither confirmed nor cleanly refuted regime_clustering (spread 32 bps, t −0.68)
REFUTED / DEAD one rung definitively red, with sufficient n divergences (rung 1), OID-as-strategy (rung 2), regime_switch (rung 4)
DOWNGRADED was green, flipped under reproduction/re-evaluation ema@1d (rung 5), archetypes A/B (rung 4), ETF flow (residual)

Where the numbers are

The ladder at a glance

1 real? (CI > 0, p_adj < 0.05) → 2 big enough? (bps × frequency ≥ 5%/yr) → 3 plateau? (≥ 70%) → 4 time-stable? (window distribution, no one-window) → 5 selection honest? (PBO < 0.2 and a 2nd sweep) → 6 risk ok? (MC: pp, p5, RoR) → 7 filter: shuffle z ≥ 1.5. At the first red rung: stop, write the verdict, move on.