Machine Learning & Rolling Training Guide (ML Guide)¶

AKQuant includes a high-performance machine learning training framework designed specifically for quantitative trading. It addresses the common "future function" leakage problem in traditional frameworks and provides out-of-the-box support for Walk-forward Validation.

Core Design Philosophy¶

1. Signal vs. Action Separation¶

A common mistake for beginners is to let the model output "buy/sell" instructions directly. In AKQuant, we decouple this process:

Model Layer: Responsible only for predicting future probabilities or values (Signal) based on historical data. It does not know how much money the account has or what the current market risk is.
Strategy Layer: Receives the Signal from the model and makes buy/sell decisions (Action) combined with risk control rules, capital management, and market status.

2. Adapter Pattern¶

To unify the disparate programming paradigms of Scikit-learn (traditional machine learning) and PyTorch (deep learning), we introduced an adapter layer:

SklearnAdapter: Adapts XGBoost, LightGBM, RandomForest, etc.
PyTorchAdapter: Adapts deep networks like LSTM, Transformer, automatically handling DataLoader and training loops.

Users only need to interface with the unified QuantModel.

3. Walk-forward Validation¶

On time-series data, random K-Fold cross-validation is incorrect because it uses future data to predict the past. The correct approach is Walk-forward:

Window 1: Train on 2020 data, predict 2021 Q1.
Window 2: Train on 2020 Q2 - 2021 Q1 data, predict 2021 Q2.
... Rolling forward like a wheel.

4. Preventing Look-ahead Bias¶

In quantitative ML, the most dangerous error is using future data. AKQuant recommends following these principles:

Features (X): Can only use data from time \(t\) and before.
Labels (y): Describe the state at time \(t+1\) (e.g., future returns), but when training at time \(t\), we actually use \(X\) at time \(t\) to fit \(y\) at time \(t+1\).
Implementation: Constructing \(y\) usually requires shift(-1), which results in the last row of data having no label (because there is no future), so it must be dropped before training.

5. Preventing Data Leakage: Using Pipeline¶

Feature preprocessing (e.g., standardization, normalization) can also introduce Look-ahead Bias. For example, using StandardScaler on the entire dataset implies that the training set contains mean and variance information from the future test set.

Solution: Encapsulate preprocessing steps in sklearn.pipeline.Pipeline.

Encapsulation: Pipeline treats the Scaler and Model as a whole.
Isolation: During Walk-forward training, Pipeline calls fit (calculating mean/variance) only on the current training window data, then applies it to the validation set.
Consistency: In the inference phase, Pipeline automatically applies the trained statistics without manual user maintenance.

Complete Runnable Example¶

The following code demonstrates how to build a robust strategy combining Pipeline and Walk-forward Validation.

import numpy as np
import pandas as pd
from typing import Tuple, Any
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from akquant import Strategy, run_backtest
from akquant.ml import SklearnAdapter

class WalkForwardStrategy(Strategy):
    """
    Demo Strategy: Predicting returns using Logistic Regression (with Pipeline preprocessing)
    """

    def __init__(self):
        # 1. Initialize Model (Encapsulate preprocessing and model using Pipeline)
        # StandardScaler: Ensures standardization using training set statistics to prevent leakage
        pipeline = Pipeline([
            ('scaler', StandardScaler()),
            ('model', LogisticRegression())
        ])

        self.model = SklearnAdapter(pipeline)

        # 2. Configure Walk-forward Validation
        # The framework automatically handles data slicing and model retraining
        self.model.set_validation(
            method='walk_forward',
            train_window=50,   # Train on past 50 bars
            rolling_step=10,   # Retrain every 10 bars
            frequency='1m',    # Data frequency
            incremental=False, # Whether to use incremental learning (Sklearn supports partial_fit)
            verbose=True       # Print training logs
        )

        # Ensure history depth covers training window + feature calculation window
        # Alternatively use self.warmup_period = 60
        self.set_history_depth(60)

    def prepare_features(self, df: pd.DataFrame, mode: str = "training") -> Tuple[Any, Any]:
        """
        [Must Implement] Feature Engineering Logic
        Used for both training (generating X, y) and inference (generating X)
        """
        X = pd.DataFrame()
        # Feature 1: 1-period return
        X['ret1'] = df['close'].pct_change()
        # Feature 2: 2-period return
        X['ret2'] = df['close'].pct_change(2)

        if mode == 'inference':
            # Inference Mode: Return only the last row of features, no y needed
            # Note: df passed during inference is the recent history_depth data
            # The last row is the latest bar, we need its features
            return X.iloc[-1:]

        # Training Mode: Construct label y (predict next period's return)
        # shift(-1) moves future return to current row as label
        future_ret = df['close'].pct_change().shift(-1)

        # Combine into one DataFrame to align drops
        data = pd.concat([X, future_ret.rename("future_ret")], axis=1)

        # Drop rows with NaN features (e.g. from history padding or initial pct_change)
        data = data.dropna(subset=["ret1", "ret2"])

        # For training, we must have a valid future return
        data = data.dropna(subset=["future_ret"])

        # Calculate y on valid data
        y = (data["future_ret"] > 0).astype(int)
        X_clean = data[["ret1", "ret2"]]

        return X_clean, y

    def on_bar(self, bar):
        # 3. Real-time Prediction & Trading

        # Get recent history for feature extraction
        # Note: Need enough history to calculate features (e.g. pct_change(2) needs at least 3 bars)
        hist_df = self.get_history_df(10)

        # If data is insufficient, return
        if len(hist_df) < 5:
            return

        # Reuse feature calculation logic!
        # Directly call prepare_features to get current features
        X_curr = self.prepare_features(hist_df, mode='inference')

        try:
            # Get prediction signal (probability)
            # SklearnAdapter returns probability of Class 1 for binary classification
            signal = self.model.predict(X_curr)[0]

            # Print signal for observation
            # print(f"Time: {bar.timestamp}, Signal: {signal:.4f}")

            # Combine with risk rules for ordering
            # Use self.get_position(symbol) to check position
            pos = self.get_position(bar.symbol)

            if signal > 0.55 and pos == 0:
                self.buy(bar.symbol, 100)
            elif signal < 0.45 and pos > 0:
                self.sell(bar.symbol, pos)

        except Exception:
            # Model might not be initialized or training failed
            pass

if __name__ == "__main__":
    # 1. Generate Synthetic Data
    print("Generating test data...")
    dates = pd.date_range(start="2023-01-01", periods=500, freq="1min")
    # Random walk price
    price = 100 + np.cumsum(np.random.randn(500))
    df = pd.DataFrame({
        "timestamp": dates,
        "open": price,
        "high": price + 1,
        "low": price - 1,
        "close": price,
        "volume": 1000,
        "symbol": "TEST"
    })

    # 2. Run Backtest
    print("Starting ML Backtest...")
    result = run_backtest(
        data=df,
        strategy=WalkForwardStrategy,
        symbols="TEST",
        lot_size=1,
        fill_policy={"price_basis": "close", "bar_offset": 0, "temporal": "same_cycle"}, # Match at close of current bar
        history_depth=60,
        warmup_period=50,
    )
    print("Backtest Finished.")

    # 3. Print Results
    print(result)

Example Output¶

After running the code above, you will see output similar to this (including detailed performance metrics):

Generating test data...
Starting ML Backtest...
2026-02-09 15:58:29 | INFO | Running backtest via run_backtest()...
[########################################] 500/500 (0s)
Backtest Finished.
BacktestResult:
                                            Value
name
start_time              2023-01-01 00:00:00+08:00
end_time                2023-01-01 08:19:00+08:00
duration                          0 days, 8:19:00
total_bars                                    500
trade_count                                  12.0
initial_market_value                     100000.0
end_market_value                        100120.50
total_pnl                                  120.50
total_return_pct                         0.120500
annualized_return                        0.127450
max_drawdown                                50.00
max_drawdown_pct                         0.049900
win_rate                                58.333333
loss_rate                               41.666667

Advanced Guide¶

1. Feature Engineering Tips¶

Excellent features are key to ML success. Besides simple returns, consider:

Technical Indicators: RSI, MACD, Bollinger Bands (recommend using talib or pandas_ta).
Volatility Features: Historical volatility, ATR.
Market Microstructure: Buying/selling pressure, volume-price relationship.
Time Features: Hour, Day of Week (note these are categorical, may need One-hot encoding).

2. Model Persistence (Save/Load)¶

Trained models can be saved for live trading or subsequent analysis.

# Save
strategy.model.save("my_model.pkl")

# Load (in __init__)
self.model.load("my_model.pkl")

3. Deep Learning Support (PyTorch)¶

Use PyTorchAdapter to easily integrate deep learning models. You need to define a standard nn.Module.

from akquant.ml import PyTorchAdapter
import torch.nn as nn
import torch.optim as optim

# Define Network
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(10, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
            nn.Sigmoid()
        )

    def forward(self, x):
        return self.fc(x)

# Use in Strategy
self.model = PyTorchAdapter(
    network=SimpleNet(),
    criterion=nn.BCELoss(),
    optimizer_cls=optim.Adam,
    lr=0.001,
    epochs=20,
    batch_size=64,
    device='cuda'  # Support GPU acceleration
)

API Reference¶

`model.set_validation`¶

Configure model validation and training methods.

def set_validation(
    self,
    method: str = 'walk_forward',
    train_window: str | int = '1y',
    test_window: str | int = '3m',
    rolling_step: str | int = '3m',
    frequency: str = '1d',
    incremental: bool = False,
    verbose: bool = False
)

method: Currently only supports 'walk_forward'.
train_window: Length of training window. Supports '1y' (1 year), '6m' (6 months), '50d' (50 days), or integer (number of bars).
test_window: Length of testing window (not strictly used in current rolling mode, mainly for evaluation configuration).
rolling_step: Rolling step size, i.e., how often to retrain the model.
frequency: Data frequency, used to correctly convert time strings to bar counts (e.g., 1y = 252 bars under '1d').
incremental: Whether to use incremental learning (continue training based on last model) or retrain from scratch. Default is False.
verbose: Whether to print training logs. Default is False.

`strategy.prepare_features`¶

Callback function that must be implemented by the user for feature engineering.

def prepare_features(self, df: pd.DataFrame, mode: str = "training") -> Tuple[Any, Any]

Input:
- df: Historical data DataFrame.
- mode: "training" (Training mode) or "inference" (Inference mode).
Output:
- mode="training": Return (X, y).
- mode="inference": Return X (usually the last row).
Note: This is a pure function and should not rely on external state.