Entry
- Supervised ML: features (technical indicators, order book, funding, sentiment) -> label (sign of future return)
- Train/val/test on a walk-forward basis
- Generate the signal from the model probability or output
- RL: state = feature vector, action = {long, short, flat}, reward = PnL
Exit
- Supervised: inverse signal or probability threshold
- RL: learns the exit endogenously as part of the policy
| Name | Typ. value | Description |
|---|---|---|
| feature_lookback | 100-1000 bars | Windows for feature computation |
| retraining_frequency | weekly/monthly | Against drift |
| walkforward_window | out-of-sample 20-30% | Robust testing |
Pros
- Can capture complex, nonlinear patterns
- Adaptive when retrained cleanly
- Scales with feature availability (alt-data such as on-chain, sentiment)
- Active research field with plenty of tool support (Freqtrade, QuantConnect, RLlib)
Cons
- Extreme overfitting risk - 99% of backtests are fake
- Non-stationary markets break models quietly and quickly
- RL agents are especially fragile - a small change in the training setup = a completely different policy
- Execution realism in backtests is usually poor
- Interpretability is essentially zero - hard to debug why it won/lost
Variants
Supervised learning
Features -> label -> model:
- Features: indicator values, order-book imbalance, funding rate, on-chain metrics, sentiment scores
- Label: future return over horizon h - e.g. sign(close[t+24] / close[t] - 1) for a 24h forecast
- Model: logistic regression, random forest, gradient boosting (XGBoost), LSTM
Signal: model output -> long / short / flat
Reinforcement learning
State -> action -> reward: - State: feature vector (as above) - Action: discrete {long, short, flat} or continuous (position size) - Reward: realized PnL per step - Model: PPO, DQN, A2C
The agent learns a policy: given a state, which action maximizes long-term cumulative reward.
The backtest-to-live gap
The most dangerous problem in ML trading: papers and backtests show 50-200% returns, live you lose money. Why:
- Lookahead bias: features unintentionally incorporate future information (e.g. the close price at trade time instead of the open)
- Data leakage: the train/test split violates time ordering
- Overfitting: the model learns noise patterns of history, not structural regularities
- Execution naivety: the backtest assumes a fill exactly at the close price. Reality: slippage, partial fills, fees
- Non-stationarity: crypto market structure in 2020 is not that of 2026; training data is stale
What actually works
Hybrid approaches: - ML as a feature for human strategies (e.g. a regime classifier that says 'trending vs. ranging') - ML as a filter on existing signals (which setups are high-probability?) - ML for optimizing the parameters of a mechanical strategy
Pure end-to-end ML/RL bots have no broadly documented consistently profitable examples. Jane Street, Citadel, Two Sigma use ML - but as a building block within tightly controlled pipelines with risk management around them, not as an autonomous 'black-box trader'.
Warning from the literature
The ScienceDirect / arXiv papers on deep-RL trading almost all share a similar pattern:
- Impressive in-sample results
- Weak out-of-sample performance (but 'model improvement' proposed)
- No live-trading follow-up studies
- When live tests exist, they cover short periods in a single regime
Robo-trader research has a reproducibility-crisis problem.
Relevance for Botty
Botty's ml/ module is not implemented. A sensible path:
-
Stage 1 (pragmatic): a feature-based filter on existing strategies. Example: a random forest that labels historical EMA-crossover signals as win/loss with features (ADX, RSI, vol regime, funding) -> only trade when probability > 0.55.
-
Stage 2 (ambitious): a regime classifier that distinguishes between trend/range/transition and switches active strategies accordingly.
-
Stage 3 (research): deep RL on a multi-asset portfolio. Only with significant dev effort and realistic expectation management.