- TIER 1 — Volatility is predictable (the project's hardest finding): vol clustering IC +0.74 across 21/21 walk-forward windows; GBM vol forecast 4h-IC +0.84 (live in ml/forecast/); Master LGBM IC +0.81 with R² +11pp over HAR-RV, lookahead-checked; confirmed cross-asset (ETH +0.74, SOL +0.78 — not a BTC artifact).
- TIER 1 — BOCPD changepoints (IC +0.16, 21/21 windows, +27% forward vol after a fresh structural break; the only regime model that is both promoted AND running live), VRP (IC −0.28, 16/16 windows), and vol targeting (Calmar uplift +1.14 across 3/3 strategies tested — not yet wired into execution/).
- TIER 2 — The edge lives in the regime gate, not the entry trigger: a meta-pattern across all four trustworthy sweeps (PBO <0.20). Raw entries (ema, donchian, macd, funding) are broadly overfit as stand-alones (PBO 0.5–0.9); they survive OOS only when gated.
- TIER 2 — Timeframe law: the edge lives on 30m–1d; anything ≤15m overfits reliably (15m: 0/144 Phase-1 runs profitable; 5m trap: in-sample score 112 at MC prob_profit 0.0015).
- TIER 2 — Exits: for breakouts the adaptive stop beats the tight fixed stop. Stop-robustness sweep (18,410 runs): the 0.5% fixed 'winner' was regime luck (it only won 2024); ATR trailing 1x/3x + partial TP survives at PBO 0.164, prob_profit 0.93, profitable even in the 2022 bear. Signal-based exits (ema/macd cross) consistently do harm.
- TIER 2 — Two filters that passed the shuffle test: vrp_filter on Donchian (z=+2.53, +37.7% PnL) and bocpd_filter on bb_extreme (z>1.5, +7–17%). BUT: filter value is conditional — BOCPD hurts Donchian (−15%), and with an adaptive ATR exit even vrp_filter loses its OOS contribution (the stop-robustness winner does without it). Filter value depends on the exit; it is not absolute.
- TIER 3 — Donchian@4h family (donchian 20 + adx_rising + ATR trailing 1x/3x + partial TP): PBO 0.164, MC prob_profit 0.933, p50 +8.7%, risk-of-ruin ~0, profitable in the 2022 bear. IMPORTANT: only on 4h — on 1d Donchian is dead in a head-to-head comparison (16/56 OOS windows, finalists prob_profit 0.13–0.50).
- FORMERLY TIER 3, DOWNGRADED 2026-06-11 — ema_crossover@1d family: looked strong in the daily-trend sweep (33/56 OOS windows, wf_avg 8.69, MC prob_profit 0.974), but the follow-up sweep MS_20260610_050521 (pinned filters, free exit matrix, 45,494 P2 runs, identical training) does not reproduce it: PBO 0.77, best finalist wf_avg 0.53, pp 0.77, MC p5 negative. Not a deploy candidate. That leaves exactly TWO directional candidates: donchian@4h and outside_inside_day@4h.
- FORMERLY TIER 3, DOWNGRADED AS A STRATEGY 2026-06-11 — outside_inside_day@4h: the signal edge remains the only significant Lab result (p_adj 0.027, 6/6 folds, +10.8 bps net), but the validation sweep shows it is not implementable as a strategy (96 exit/filter variants, median -1.18%% across 2024+2025, best +0.10%%, 0 Phase-2 qualifiers). The edge is real but tiny: ~+0.3%%/year ceiling. Only remaining directional candidate: donchian@4h.
- TIER 4 — The methodology itself: multi-window training debunked archetypes A/B as one-window wonders (cross-check PBO 0.77), walk-forward killed 5/6 ML-overview findings, shuffle tests filtered out trade-count artifacts, min_trades≥30 eliminated 3-lucky-trade winners. Without this stack, money would repeatedly have been riding on ghosts.
- HONEST FOOTNOTES: ETF flow looked promoted (IC +0.37), but −70% of the signal is pre-event momentum (residual test). FOMC +50bps: CI lower bound only +8bps → watchlist. Directional forecasting is dead in EVERY form tested (best IC +0.02 with 47 features; divergences debunked 3× across 2,900+ cells). The live bb_extreme config belongs to the debunked archetype B (best MC prob_profit now only 0.61) — it runs on weaker evidence than Tier 1–3.
The Bar
"Demonstrably reliable" here means: passed walk-forward across multiple windows, a shuffle test, or PBO — not "looked good in the backtest". This bar is deliberately brutal: of 22 megasweeps only 4 have a trustworthy PBO (<0.20), of 45 ML experiments ~8 survived as promoted, and the Indicator Lab found exactly one significant raw signal among 13 indicators × 6 timeframes. What follows below survived these filters.
Tier 1 — Confirmed by multiple walk-forwards: the volatility axis
| Finding | Metric | Status |
|---|---|---|
| Vol clustering | IC +0.74, 21/21 windows | confirmed, cross-asset (ETH +0.74, SOL +0.78) |
| Vol forecast (GBM) | 4h-IC +0.84, beats persistence & HAR-RV | live in ml/forecast/ |
| Master LGBM | IC +0.81, R² +11pp over HAR-RV | promoted, lookahead-checked (feature ablation) |
| BOCPD changepoints | IC +0.16, 21/21 windows, +27% forward vol | promoted + live (bocpd_live.py) |
| VRP | IC −0.28, 16/16 windows | promoted |
| Vol targeting | Calmar +1.14 across 3/3 strategies | promoted, not yet wired into execution/ |
In plain terms: BTC volatility comes in blocks — turbulent stays turbulent, calm stays calm. How violent tomorrow will be, we can predict well. Which direction it goes, we cannot. That is why everything that turns vol knowledge into money (position size, stop width, not trading through a structural break) is our most reliable track.
Tier 2 — Structural rules that hold consistently across many sweeps
- The edge lives in the regime gate, not the entry trigger. Every robust sweep winner carries a trend gate (min_adx≥25, ema200, Hurst) or is the regime. Raw entries have been broadly exposed as overfit stand-alones (PBO 0.5–0.9) — the trigger is interchangeable, the gate does the work. (Full derivation: Megasweep PBO synthesis: the edge sits in the regime gate, not the entry trigger, usage consequence: Detecting & predicting market regimes: ADX/DMI is only one lens among many.)
- Timeframe law: 30m–1d. Anything ≤15m overfits reliably. The 5m
hurst+macdtrap remains the cautionary tale: in-sample score 112, Monte Carlo prob_profit 0.0015. - Exits: adaptive beats tight-fixed (for breakouts). The stop-robustness sweep isolated the exit question cleanly (entry+filter pinned, 48 exit variants, 18,410 Phase-2 runs): fixed stops dominated the raw leaderboard (486 of the top 500!), but only because they rode ONE window — all died on the two-window criterion. What survived: ATR trailing (initial 1×, trail 3×) + partial TP: PBO 0.164, prob_profit 0.93, profitable even in the 2022 bear. Signal exits (ema/macd cross) consistently do harm everywhere.
- Validated filters — but conditional.
vrp_filter(shuffle-z +2.53 on Donchian, +37.7%) andbocpd_filter(z>1.5 on bb_extreme, +7–17%) are genuine signals. But: BOCPD hurts breakouts (−15%, it blocks exactly the changepoint bars on which they fire), and with an adaptive ATR exit even vrp loses its OOS contribution. Filter value depends on entry type AND exit — it is never absolute.
Tier 3 — The three validated directional candidates
| Family | Evidence | Caveat |
|---|---|---|
| Donchian@4h (dc 20 + adx_rising + ATR trail 1×/3× + partial TP) | PBO 0.164, MC prob_profit 0.933, p50 +8.7%, RoR ~0, profitable in the 2022 bear | ONLY 4h — dead on 1d (see below); ann. return small (~2.3% avg unleveraged) |
| ~~ema_crossover@1d~~ (bocpd+volume+fixed 2.5%) | looked strong: 33/56 OOS windows, wf_avg 8.69, MC pp 0.974 | DOWNGRADED 2026-06-11 — follow-up sweep does not reproduce (see below) |
| ~~outside_inside_day@4h~~ (Raschke) | signal edge established: p_adj 0.027, 6/6 folds, +10.8 bps net | DOWNGRADED AS A STRATEGY 2026-06-11 — validation sweep done_no_winners (see below) |
Case study Donchian: why "good" always needs a timeframe qualifier
Donchian is the best example that evidence is conditional — the same entry idea, two verdicts:
- On 4h, gated: one of our best-validated building blocks (numbers above; predecessor sweep PBO 0.33 with vrp as the driver).
- On 1d, in the direct three-entry comparison (daily-trend sweep, identical methodology): dead. 16/56 OOS windows positive (avg −0.17%), best finalists MC prob_profit 0.127–0.499, median MC return ≤0. ema_crossover won 1d by a mile (33/56, pp 0.974).
- Ungated, in the old broad sweeps: PBO 0.77 — overfit as a stand-alone.
Mechanistically plausible: on 1d there are simply too few channel breakouts per year, and the 20-day channel only fires once the daily trend has already run far — the 4h grid sees the same trend earlier and often enough to earn back the losses from false breakouts. Rule of thumb: "Does X work?" for us is always "does X work on which TF, with which gate, with which exit?"
Downgrade of the ema@1d family (2026-06-11) — the methodology catches the next ghost
DOWNGRADE 2026-06-11: The follow-up sweep MS_20260610_050521 (ema_crossover_edge_v1: filters pinned, full 48-exit matrix, 1d+4h, identical training 2024+2025, 45,494 P2 runs) does NOT reproduce the finding — PBO 0.77, best finalist wf_avg only 0.53 (previously 8.69), prob_profit 0.77 (previously 0.974), MC p5 negative; the 4h variants did not even reach the final round. The star finding was presumably a selection artifact of the smaller search space. The ema@1d family is therefore NO LONGER a deploy candidate — the same one-window/selection pattern that already debunked archetypes A/B. This is not a contradiction of the methodology but its confirmation: an enlarged search space + PBO exposes what a smaller search space made look robust. Consequence for the live wallets: only donchian@4h (gated, adaptive exit) is currently deploy-grade; outside_inside_day@4h is the next candidate but first needs its own validation sweep.
Downgrade of the OID strategy (2026-06-11) — significant is not the same as large
DOWNGRADED AS A STRATEGY 2026-06-11 (sweep MS_20260611_051553, oid_validation_v1): as a full strategy OID@4h already fails in Phase 1 — 96 variants (full 48-exit matrix × {none, cooldown}), median -1.18%% across 2024+2025, best variant +0.10%% (pf 1.03), zero Phase-2 qualifiers, done_no_winners. The signal edge itself remains statistically valid but is economically too small: +10.85 bps net × ~28 signals/year = approx. +0.3%%/year gross ceiling. LESSON: the Lab "robust" verdict answers significance, not size — before every deploy, compute bps-per-trade × frequency. That leaves exactly ONE directional candidate: donchian@4h.
Tier 4 — The methodology itself is the most important validated building block
The stack (PBO/CSCV + multi-window walk-forward + min_trades≥30 + bear/crash in OOS + shuffle tests + residual checks) has repeatedly exposed convincing fakes before money was riding on them: archetypes A/B (one-window wonders, cross-check PBO 0.77), regime_switch (in-sample +10.9%, zero OOS survivors), 5/6 ML-overview findings, the ETF-flow signal (−70% momentum artifact in the residual test), the 3-lucky-trade winners of the min_trades-8 sweeps. Each of these catches would otherwise have become a live loss.
Honest footnotes
- Directional forecasting is dead — in every form: best directional IC +0.02 (47 features), divergences debunked 3× across 2,900+ cells, RSI failure swings dead, regime direction noise. Reliable negative knowledge: resources spent there are wasted.
- FOMC (+50 bps, t=2.28) is plausible, but CI lower bound +8 bps → watchlist, not "reliable".
- The live bb_extreme config belongs to archetype B, which the clean cross-check classified as a one-window wonder (best MC pp 0.61) — it runs on weaker evidence than anything in Tier 1–3.
The whole picture in one sentence
In Botty, reliable performance comes almost exclusively from the risk side (predict vol → sizing, stops, not trading) plus structural discipline (gates, TF ≥30m, adaptive exits) — on the directional side there exist exactly three narrow, validated candidates, and a methodology that keeps us from imagining more.