Botty · Out-of-Sample-Validierung

Out-of-Sample (OOS) Validation

Evaluating a strategy on held-out data that was NOT visible during building and optimization. In-sample performance is systematically too optimistic (overfitting); only OOS estimates whether an edge generalizes. For time series necessarily with time ordering + embargo against label leakage - and without repeated 'peeking'.

What this is about

In-sample (IS) is the data you use to build and optimize your strategy. Out-of-sample (OOS) is held-back data that you touch only afterwards and only for evaluation. The only honest performance estimator is the one on OOS data.

Why in-sample lies

If you optimize parameters on one period, you will always find some that look good there — even if the strategy has no real edge. That is overfitting: the parameters explain the noise of the past, not the structure of the market. IS performance is therefore systematically too optimistic. OOS measures whether any of it survives in unseen data.

Forms of OOS validation

Method	Idea
Holdout	One-off train/test split (e.g. 70/30). Simple, but only one test window.
Walk-Forward	Rolling OOS windows — tests across many market phases.
k-fold CV	Data split into k blocks, each one serving as test once. Use with caution for time series.
Purged / embargoed CV	Time-series CV that cuts a safety margin around the boundary.

Time-series specifics (critical for trading)

No random shuffling. Respect the ordering, otherwise the model sees the future (leakage).
Embargo / purging. If labels look into the future (e.g. forward return over 24 h), there must be a gap >= label horizon between train end and test start, otherwise train and test information overlap.
Lookahead and survivorship bias must be ruled out beforehand — OOS does not repair them.

The deadly sin: peeking

As soon as you adjust the strategy until the OOS result looks good, OOS effectively becomes in-sample. Every look at the test set "consumes" it. Measuring 100 variants on the OOS and picking the best is data snooping — the apparent significance is then just chance (see White's Reality Check and the Monte Carlo permutation test in the Monte Carlo simulation research).

How Botty does it

ml/splits.py -> walk_forward_splits(train_months=12, test_months=3, embargo_minutes=1440) and purged-CV generators. The core invariant: train ends at least embargo bars before test start, so forward-looking labels cannot leak.
ml/CLAUDE.md rule: "Walk-forward from day one" and "No leakage" — every claimed edge must hold across multiple test windows.
Mega-sweep Phase 3 (backtesting/megasweep.py) evaluates only the test windows, never the train windows.

Trade-offs

✅ The only honest generalization estimator. ✅ With walk-forward, automatically covers multiple regimes.

❌ Reduces the data available for training (the test portion "costs"). ❌ Worthless if you look at the test set repeatedly (peeking). ❌ Difficult with very rare trades — too few signals per test window.