Lab · ML Experiments

ML — Pattern Discovery

Inverted workflow: find conditional edges in BTC data first, build strategies second.
55 experiments

Master-LightGBM — kitchen-sink 4h vol forecast

Promoted
2026-05-20 synthesislightgbmvol-forecastall-features
Hypothesis
A LightGBM regressor on a unified causal feature panel (lagged returns, multi-window RV, funding, BOCPD p_short, HMM p_state1, VRP, stablecoin Δ7d, ETF flow, DXY z-score, time) beats HAR-RV on OOS log-vol R² at 4h by ≥ 3 pp.
Verdict
**PROMOTE** — LightGBM lifts 4h vol-forecast R² by +10.96 pp vs HAR-RV baseline (0.5541 → 0.6636). Top features by importance: log_rv_7d_ann, iv_ann, hour_cos. Replace HAR-RV with LGBM in `ml/forecast/` after a clean production run.
n_windows
13
n_features_new
7
top_3_features
['log_rv_7d_ann', 'iv_ann', 'hour_cos']
n_features_full
27
pooled_IC_HAR_RV
+0.7258
pooled_R2_HAR_RV
+0.5541
pooled_IC_LGBM_all
+0.8142
pooled_R2_LGBM_all
+0.6636
pooled_R2_LGBM_new
+0.5009
lift_pp_lgbm_vs_har
+10.9555
pooled_R2_persistence
+0.4061

Master-LightGBM — kitchen-sink 4h vol forecast

2026-05-20 · status: promoted · 39.1s

Hypothesis: A LightGBM regressor on a unified causal feature panel (lagged returns, multi-window RV, funding, BOCPD p_short, HMM p_state1, VRP, stablecoin Δ7d, ETF flow, DXY z-score, time) beats HAR-RV on OOS log-vol R² at 4h by ≥ 3 pp.

Verdict: PROMOTE — LightGBM lifts 4h vol-forecast R² by +10.96 pp vs HAR-RV baseline (0.5541 → 0.6636). Top features by importance: log_rv_7d_ann, iv_ann, hour_cos. Replace HAR-RV with LGBM in ml/forecast/ after a clean production run.

Key metrics

metric value
pooled_R2_persistence +0.4061
pooled_R2_HAR_RV +0.5541
pooled_R2_LGBM_all +0.6636
pooled_R2_LGBM_new +0.5009
pooled_IC_HAR_RV +0.7258
pooled_IC_LGBM_all +0.8142
lift_pp_lgbm_vs_har +10.9555
n_windows 13
n_features_full 27
n_features_new 7
top_3_features ['log_rv_7d_ann', 'iv_ann', 'hour_cos']

Approach

Build a unified 1h causal panel with lagged returns (1/4/24/168h), multi-window realized vol (1h/4h/1d/7d, annualised), funding (rate, z-score, cum-1d), BOCPD p_short (from 15m, causal forward filter), HMM p_state1 (from 1h, re-fit per walk-forward split), VRP (DVOL annualised IV − trailing 4h RV), stablecoin Δ7d (1d shift), ETF flow (1d shift), and DXY 4h z-score.

Walk-forward 12mo train / 3mo test, embargo = 1440 min, starting 2022-01. Target = log of forward 4h annualised RV. Models: Persistence, HAR-RV (baseline), LightGBM-all (kitchen sink, native NaN handling), LightGBM-new-only (new features only).

Pooled OOS metrics

model R2_log IC_spearman n_oos
persistence 0.4061 0.6751 28,464
har_rv 0.5541 0.7258 28,464
lgbm_all 0.6636 0.8142 28,464
lgbm_new 0.5009 0.686 28,464

Lift over HAR-RV baseline

  • LightGBM-all R² lift: +10.96 pp (0.5541 → 0.6636)

  • LightGBM-new-only R²: +0.5009

Per-window R² (13 windows)

window n pers_r2 har_r2 lgbm_all_r2 lgbm_new_r2
2023-01-02 → 2023-04-02 2160 0.3337 0.5331 0.6242 0.3803
2023-04-02 → 2023-07-02 2184 -0.0239 0.2852 0.3932 0.1875
2023-07-02 → 2023-10-02 2208 0.3143 0.4665 0.5786 0.4058
2023-10-02 → 2024-01-02 2208 0.1651 0.4091 0.5383 0.2894
2024-01-02 → 2024-04-02 2184 0.3579 0.5275 0.6297 0.4504
2024-04-02 → 2024-07-02 2184 0.4061 0.55 0.6379 0.5249
2024-07-02 → 2024-10-02 2208 0.2171 0.4311 0.5577 0.3789
2024-10-02 → 2025-01-02 2208 0.3484 0.4802 0.6892 0.4954
2025-01-02 → 2025-04-02 2160 0.469 0.5895 0.7042 0.5417
2025-04-02 → 2025-07-02 2184 0.3325 0.5167 0.6393 0.4447
2025-07-02 → 2025-10-02 2208 0.2737 0.4577 0.6344 0.45
2025-10-02 → 2026-01-02 2208 0.3028 0.482 0.655 0.4616
2026-01-02 → 2026-04-02 2160 0.3788 0.5446 0.7147 0.5373

Feature importance (LightGBM-all, mean gain across folds)

feature mean_gain max_gain
log_rv_7d_ann 418.8 543
iv_ann 362.4 423
hour_cos 355.5 400
log_rv_1d_ann 317.2 416
dow 317.2 351
ret_7d 314.6 378
stablecoin_d7 314.5 367
ret_24h 268.8 305
hour_sin 249.7 286
funding_cum_1d 198 306
cvd_divergence_4h 193.2 258
funding_z_30d 169.6 280
log_rv_1h_ann 163.3 201
n_whale_trades_4h 155.2 190
ret_4h 153.1 204
etf_flow 152.5 339
vrp 145.3 252
hmm_p_state1 141.4 236
range_4h 138.5 190
n_large_trades_4h 136.2 174
bocpd_p_short 118 143
log_rv_4h_ann 109.7 138
ret_1h 99.5 116
vol_z_1d 97.1 133
log_vol 89.7 139
dxy_z_4h 88.9 172
funding_rate 73.2 139

per-window R²

feature importance