Knowledge · Research · What Demonstrably Works in Botty — Evidence Ranking Across All Sweeps, Tests & ML Experiments

What Demonstrably Works in Botty — Evidence Ranking Across All Sweeps, Tests & ML Experiments

Strategy analysis 2026-06-11 6 sources
The positive counterpart to [[what_doesnt_work]], but drawn from Botty's OWN data: a consolidated ranking of what has demonstrably and reliably worked across 22+ megasweeps, 45 ML experiments, the Indicator Lab, and shuffle tests. The bar: walk-forward across multiple windows, a shuffle test, or PBO — no in-sample winners. Core picture: reliable performance comes almost exclusively from the RISK side (vol forecasting IC +0.74…+0.84, BOCPD, VRP, vol targeting) plus structural discipline (regime gates as the edge carriers, timeframes 30m–1d, adaptive exits). On the directional side there is exactly ONE narrow, validated candidate: the Donchian@4h family (gated, ATR trailing + partial TP, PBO 0.164). ema@1d was downgraded on 06/11 after a failed reproduction sweep (PBO 0.77); outside_inside_day@4h does have a genuine signal edge but failed as a strategy in the validation sweep (edge ~+0.3%/year — significant but too small) (the only significant raw signal in the Indicator Lab). Plus: the validation methodology itself is the most important 'edge' — it has repeatedly exposed convincing fakes before money was riding on them.
  • TIER 1 — Volatility is predictable (the project's hardest finding): vol clustering IC +0.74 across 21/21 walk-forward windows; GBM vol forecast 4h-IC +0.84 (live in ml/forecast/); Master LGBM IC +0.81 with R² +11pp over HAR-RV, lookahead-checked; confirmed cross-asset (ETH +0.74, SOL +0.78 — not a BTC artifact).
  • TIER 1 — BOCPD changepoints (IC +0.16, 21/21 windows, +27% forward vol after a fresh structural break; the only regime model that is both promoted AND running live), VRP (IC −0.28, 16/16 windows), and vol targeting (Calmar uplift +1.14 across 3/3 strategies tested — not yet wired into execution/).
  • TIER 2 — The edge lives in the regime gate, not the entry trigger: a meta-pattern across all four trustworthy sweeps (PBO <0.20). Raw entries (ema, donchian, macd, funding) are broadly overfit as stand-alones (PBO 0.5–0.9); they survive OOS only when gated.
  • TIER 2 — Timeframe law: the edge lives on 30m–1d; anything ≤15m overfits reliably (15m: 0/144 Phase-1 runs profitable; 5m trap: in-sample score 112 at MC prob_profit 0.0015).
  • TIER 2 — Exits: for breakouts the adaptive stop beats the tight fixed stop. Stop-robustness sweep (18,410 runs): the 0.5% fixed 'winner' was regime luck (it only won 2024); ATR trailing 1x/3x + partial TP survives at PBO 0.164, prob_profit 0.93, profitable even in the 2022 bear. Signal-based exits (ema/macd cross) consistently do harm.
  • TIER 2 — Two filters that passed the shuffle test: vrp_filter on Donchian (z=+2.53, +37.7% PnL) and bocpd_filter on bb_extreme (z>1.5, +7–17%). BUT: filter value is conditional — BOCPD hurts Donchian (−15%), and with an adaptive ATR exit even vrp_filter loses its OOS contribution (the stop-robustness winner does without it). Filter value depends on the exit; it is not absolute.
  • TIER 3 — Donchian@4h family (donchian 20 + adx_rising + ATR trailing 1x/3x + partial TP): PBO 0.164, MC prob_profit 0.933, p50 +8.7%, risk-of-ruin ~0, profitable in the 2022 bear. IMPORTANT: only on 4h — on 1d Donchian is dead in a head-to-head comparison (16/56 OOS windows, finalists prob_profit 0.13–0.50).
  • FORMERLY TIER 3, DOWNGRADED 2026-06-11 — ema_crossover@1d family: looked strong in the daily-trend sweep (33/56 OOS windows, wf_avg 8.69, MC prob_profit 0.974), but the follow-up sweep MS_20260610_050521 (pinned filters, free exit matrix, 45,494 P2 runs, identical training) does not reproduce it: PBO 0.77, best finalist wf_avg 0.53, pp 0.77, MC p5 negative. Not a deploy candidate. That leaves exactly TWO directional candidates: donchian@4h and outside_inside_day@4h.
  • FORMERLY TIER 3, DOWNGRADED AS A STRATEGY 2026-06-11 — outside_inside_day@4h: the signal edge remains the only significant Lab result (p_adj 0.027, 6/6 folds, +10.8 bps net), but the validation sweep shows it is not implementable as a strategy (96 exit/filter variants, median -1.18%% across 2024+2025, best +0.10%%, 0 Phase-2 qualifiers). The edge is real but tiny: ~+0.3%%/year ceiling. Only remaining directional candidate: donchian@4h.
  • TIER 4 — The methodology itself: multi-window training debunked archetypes A/B as one-window wonders (cross-check PBO 0.77), walk-forward killed 5/6 ML-overview findings, shuffle tests filtered out trade-count artifacts, min_trades≥30 eliminated 3-lucky-trade winners. Without this stack, money would repeatedly have been riding on ghosts.
  • HONEST FOOTNOTES: ETF flow looked promoted (IC +0.37), but −70% of the signal is pre-event momentum (residual test). FOMC +50bps: CI lower bound only +8bps → watchlist. Directional forecasting is dead in EVERY form tested (best IC +0.02 with 47 features; divergences debunked 3× across 2,900+ cells). The live bb_extreme config belongs to the debunked archetype B (best MC prob_profit now only 0.61) — it runs on weaker evidence than Tier 1–3.
P1 Wire vol-targeting + vol-forecast sizing into execution/
The strongest validated finding (Tier 1) is not yet monetized: vol_target_strategies showed Calmar +1.14 across 3/3 strategies, the GBM forecast (IC +0.84) runs live but is not used for sizing. This is the most direct path from 'proven' to 'earned'.
Implementation: fixed-fractional sizing from stop distance × scaling factor from predict_vol_4h (mind the calibration correction ×1.0442); leverage cap 3–5x. First benchmark against static sizing in the backtest.
Evidence: ml/experiments/vol_target_strategies (SHIP), vol_forecast (IC +0.84), vol_forecast_calibration (bias −4.2%).
P2 Treat the Donchian@4h family (ATR trail 1x/3x + partial TP) as the ONLY current deploy candidate; ema@1d dropped
After the downgrade of the ema@1d family (follow-up PBO 0.77), donchian@4h is the only directional building block with PBO <0.2, a positive MC distribution, AND bear robustness — and the only one whose family reproduced across TWO independent sweeps (MS_122822 PBO 0.33, MS_071508 PBO 0.164). Donchian ONLY on 4h (1d demonstrably dead), with an adaptive exit. Next candidate: outside_inside_day@4h after its own validation sweep.
Implementation: Record the candidate configs from MS_20260609_071508 (P2_14239) and MS_20260609_042646 (ema finalist); make the live decision only after the regime-gate bake-off that settles the gate question for the ema family.
Evidence: MS_20260609_071508: PBO 0.164, pp 0.933, profitable Y2022 (P2_14239). Counter-evidence ema: MS_20260610_050521 PBO 0.77, wf_avg 0.53 vs 8.69.
P3 Cement the methodology rules as hard sweep defaults
The rules that have demonstrably caught fakes must not depend on the discipline of individual sweep configs: min_trades≥30, ≥2 regime windows in training (min_window_wins=2), bear+crash in the held-out OOS, a shuffle test before every filter promote, a residual check for retrospectively stamped data sources.
Implementation: Harden the defaults in megasweep.create() (min_trades floor, warning on single-window training); a checklist already exists in product/memory/modules/backtesting.md.
Evidence: Every rule has at least one documented catch (MS_0429 artifact, MS_0601 lucky trades, ETF-flow residual, C7 shuffle).

The Bar

"Demonstrably reliable" here means: passed walk-forward across multiple windows, a shuffle test, or PBO — not "looked good in the backtest". This bar is deliberately brutal: of 22 megasweeps only 4 have a trustworthy PBO (<0.20), of 45 ML experiments ~8 survived as promoted, and the Indicator Lab found exactly one significant raw signal among 13 indicators × 6 timeframes. What follows below survived these filters.

Tier 1 — Confirmed by multiple walk-forwards: the volatility axis

Finding Metric Status
Vol clustering IC +0.74, 21/21 windows confirmed, cross-asset (ETH +0.74, SOL +0.78)
Vol forecast (GBM) 4h-IC +0.84, beats persistence & HAR-RV live in ml/forecast/
Master LGBM IC +0.81, R² +11pp over HAR-RV promoted, lookahead-checked (feature ablation)
BOCPD changepoints IC +0.16, 21/21 windows, +27% forward vol promoted + live (bocpd_live.py)
VRP IC −0.28, 16/16 windows promoted
Vol targeting Calmar +1.14 across 3/3 strategies promoted, not yet wired into execution/

In plain terms: BTC volatility comes in blocks — turbulent stays turbulent, calm stays calm. How violent tomorrow will be, we can predict well. Which direction it goes, we cannot. That is why everything that turns vol knowledge into money (position size, stop width, not trading through a structural break) is our most reliable track.

Tier 2 — Structural rules that hold consistently across many sweeps

  1. The edge lives in the regime gate, not the entry trigger. Every robust sweep winner carries a trend gate (min_adx≥25, ema200, Hurst) or is the regime. Raw entries have been broadly exposed as overfit stand-alones (PBO 0.5–0.9) — the trigger is interchangeable, the gate does the work. (Full derivation: Megasweep PBO synthesis: the edge sits in the regime gate, not the entry trigger, usage consequence: Detecting & predicting market regimes: ADX/DMI is only one lens among many.)
  2. Timeframe law: 30m–1d. Anything ≤15m overfits reliably. The 5m hurst+macd trap remains the cautionary tale: in-sample score 112, Monte Carlo prob_profit 0.0015.
  3. Exits: adaptive beats tight-fixed (for breakouts). The stop-robustness sweep isolated the exit question cleanly (entry+filter pinned, 48 exit variants, 18,410 Phase-2 runs): fixed stops dominated the raw leaderboard (486 of the top 500!), but only because they rode ONE window — all died on the two-window criterion. What survived: ATR trailing (initial 1×, trail 3×) + partial TP: PBO 0.164, prob_profit 0.93, profitable even in the 2022 bear. Signal exits (ema/macd cross) consistently do harm everywhere.
  4. Validated filters — but conditional. vrp_filter (shuffle-z +2.53 on Donchian, +37.7%) and bocpd_filter (z>1.5 on bb_extreme, +7–17%) are genuine signals. But: BOCPD hurts breakouts (−15%, it blocks exactly the changepoint bars on which they fire), and with an adaptive ATR exit even vrp loses its OOS contribution. Filter value depends on entry type AND exit — it is never absolute.

Tier 3 — The three validated directional candidates

Family Evidence Caveat
Donchian@4h (dc 20 + adx_rising + ATR trail 1×/3× + partial TP) PBO 0.164, MC prob_profit 0.933, p50 +8.7%, RoR ~0, profitable in the 2022 bear ONLY 4h — dead on 1d (see below); ann. return small (~2.3% avg unleveraged)
~~ema_crossover@1d~~ (bocpd+volume+fixed 2.5%) looked strong: 33/56 OOS windows, wf_avg 8.69, MC pp 0.974 DOWNGRADED 2026-06-11 — follow-up sweep does not reproduce (see below)
~~outside_inside_day@4h~~ (Raschke) signal edge established: p_adj 0.027, 6/6 folds, +10.8 bps net DOWNGRADED AS A STRATEGY 2026-06-11 — validation sweep done_no_winners (see below)

Case study Donchian: why "good" always needs a timeframe qualifier

Donchian is the best example that evidence is conditional — the same entry idea, two verdicts:

  • On 4h, gated: one of our best-validated building blocks (numbers above; predecessor sweep PBO 0.33 with vrp as the driver).
  • On 1d, in the direct three-entry comparison (daily-trend sweep, identical methodology): dead. 16/56 OOS windows positive (avg −0.17%), best finalists MC prob_profit 0.127–0.499, median MC return ≤0. ema_crossover won 1d by a mile (33/56, pp 0.974).
  • Ungated, in the old broad sweeps: PBO 0.77 — overfit as a stand-alone.

Mechanistically plausible: on 1d there are simply too few channel breakouts per year, and the 20-day channel only fires once the daily trend has already run far — the 4h grid sees the same trend earlier and often enough to earn back the losses from false breakouts. Rule of thumb: "Does X work?" for us is always "does X work on which TF, with which gate, with which exit?"

Downgrade of the ema@1d family (2026-06-11) — the methodology catches the next ghost

DOWNGRADE 2026-06-11: The follow-up sweep MS_20260610_050521 (ema_crossover_edge_v1: filters pinned, full 48-exit matrix, 1d+4h, identical training 2024+2025, 45,494 P2 runs) does NOT reproduce the finding — PBO 0.77, best finalist wf_avg only 0.53 (previously 8.69), prob_profit 0.77 (previously 0.974), MC p5 negative; the 4h variants did not even reach the final round. The star finding was presumably a selection artifact of the smaller search space. The ema@1d family is therefore NO LONGER a deploy candidate — the same one-window/selection pattern that already debunked archetypes A/B. This is not a contradiction of the methodology but its confirmation: an enlarged search space + PBO exposes what a smaller search space made look robust. Consequence for the live wallets: only donchian@4h (gated, adaptive exit) is currently deploy-grade; outside_inside_day@4h is the next candidate but first needs its own validation sweep.

Downgrade of the OID strategy (2026-06-11) — significant is not the same as large

DOWNGRADED AS A STRATEGY 2026-06-11 (sweep MS_20260611_051553, oid_validation_v1): as a full strategy OID@4h already fails in Phase 1 — 96 variants (full 48-exit matrix × {none, cooldown}), median -1.18%% across 2024+2025, best variant +0.10%% (pf 1.03), zero Phase-2 qualifiers, done_no_winners. The signal edge itself remains statistically valid but is economically too small: +10.85 bps net × ~28 signals/year = approx. +0.3%%/year gross ceiling. LESSON: the Lab "robust" verdict answers significance, not size — before every deploy, compute bps-per-trade × frequency. That leaves exactly ONE directional candidate: donchian@4h.

Tier 4 — The methodology itself is the most important validated building block

The stack (PBO/CSCV + multi-window walk-forward + min_trades≥30 + bear/crash in OOS + shuffle tests + residual checks) has repeatedly exposed convincing fakes before money was riding on them: archetypes A/B (one-window wonders, cross-check PBO 0.77), regime_switch (in-sample +10.9%, zero OOS survivors), 5/6 ML-overview findings, the ETF-flow signal (−70% momentum artifact in the residual test), the 3-lucky-trade winners of the min_trades-8 sweeps. Each of these catches would otherwise have become a live loss.

Honest footnotes

  • Directional forecasting is dead — in every form: best directional IC +0.02 (47 features), divergences debunked 3× across 2,900+ cells, RSI failure swings dead, regime direction noise. Reliable negative knowledge: resources spent there are wasted.
  • FOMC (+50 bps, t=2.28) is plausible, but CI lower bound +8 bps → watchlist, not "reliable".
  • The live bb_extreme config belongs to archetype B, which the clean cross-check classified as a one-window wonder (best MC pp 0.61) — it runs on weaker evidence than anything in Tier 1–3.

The whole picture in one sentence

In Botty, reliable performance comes almost exclusively from the risk side (predict vol → sizing, stops, not trading) plus structural discipline (gates, TF ≥30m, adaptive exits) — on the directional side there exist exactly three narrow, validated candidates, and a methodology that keeps us from imagining more.