Signals·ML Models·Intermediate

Regression Signal Model

Train a regression model to predict next-bar returns from technical features — with walk-forward cross-validation and feature importance.

regressionpredictionfeatures

Signals — Regression Models


1. Dependency Installation

[1]
!pip install pandas numpy plotly scikit-learn
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)
Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (5.24.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.12/dist-packages (1.6.1)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.12/dist-packages (from plotly) (9.1.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from plotly) (26.1)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.16.3)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (3.6.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)

2. Library Imports

[2]
import warnings; warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

3. Strategy Overview

Regression models predict the magnitude of future price returns (continuous output) rather than a binary direction. The predicted return can then be thresholded to generate trading signals.

Models evaluated:

  • Ridge Regression: L2-regularised linear model; stable under multicollinearity.
  • Lasso Regression: L1-regularised; performs automatic feature selection via zero coefficients.
  • Random Forest Regressor: Non-linear ensemble; captures complex feature interactions.
  • Gradient Boosting Regressor: Boosted tree ensemble; typically the highest accuracy.

Evaluation metrics:

  • MSE / RMSE: Penalises large errors heavily.
  • MAE: Average magnitude of errors; more robust to outliers than RMSE.
  • : Proportion of variance explained; 1.0 = perfect fit.

Signal derivation: Predicted return > signal_threshold → Buy (+1); < −signal_threshold → Sell (−1); otherwise no signal (0).

4. Data Generation

[3]
def generate_data(periods: int) -> pd.DataFrame:
    """
    Generate synthetic OHLCV price data using a geometric random walk.

    Parameters
    ----------
    periods : int
        Number of 1-minute bars to generate.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns: open, high, low, close, volume, datetime.
    """
    start_date     = pd.to_datetime("2024-01-01 00:00:00+00:00")
    datetime_index = pd.date_range(start_date, periods=periods, freq="1min", tz="UTC")
    price_data = []
    last_close = 42000
    for i in range(periods):
        open_price  = last_close + np.random.normal(0, last_close * 0.0005)
        close_price = open_price + np.random.normal(0, last_close * 0.005)
        body_high   = max(open_price, close_price)
        body_low    = min(open_price, close_price)
        high_price  = max(body_high + abs(np.random.normal(0, last_close * 0.002)), open_price, close_price)
        low_price   = min(body_low  - abs(np.random.normal(0, last_close * 0.002)), open_price, close_price)
        if high_price < low_price:
            high_price, low_price = low_price, high_price
        price_data.append({
            "open":  max(1, int(open_price)),
            "high":  max(1, int(high_price)),
            "low":   max(1, int(low_price)),
            "close": max(1, int(close_price)),
        })
        last_close = close_price
    df = pd.DataFrame(price_data, index=datetime_index)
    df.index.name = "datetime"
    df["volume"]   = np.random.uniform(100.0, 500.0, periods)
    df["datetime"] = df.index.to_series()
    return df.reset_index(drop=True)

df = generate_data(500)
display(df.head())
open high low close volume datetime
0 41998 42037 41919 42019 417.369207 2024-01-01 00:00:00+00:00
1 42007 42374 41880 42200 336.167065 2024-01-01 00:01:00+00:00
2 42238 42606 42199 42460 257.399884 2024-01-01 00:02:00+00:00
3 42490 42659 42156 42211 108.553065 2024-01-01 00:03:00+00:00
4 42215 42379 42121 42273 469.966533 2024-01-01 00:04:00+00:00

5. Feature Engineering (Shared Pipeline)

[4]
def build_regression_features(df: pd.DataFrame, horizon: int = 5) -> tuple:
    """
    Compute technical features and continuous forward-return labels.

    Parameters
    ----------
    df : pd.DataFrame  OHLCV DataFrame.
    horizon : int      Prediction horizon in bars.

    Returns
    -------
    tuple
        (X_train, X_test, y_train, y_test, feature_names)
    """
    df = df.copy().sort_values("datetime", ignore_index=True)
    df["return_1"]   = df["close"].pct_change(1)
    df["return_5"]   = df["close"].pct_change(5)
    delta = df["close"].diff()
    gain  = delta.clip(lower=0).rolling(14).mean()
    loss  = (-delta.clip(upper=0)).rolling(14).mean()
    df["rsi_14"]    = 100 - 100 / (1 + gain / loss.replace(0, np.nan))
    df["sma_ratio"]  = df["close"].rolling(10).mean() / df["close"].rolling(20).mean()
    tr = pd.concat([
        df["high"] - df["low"],
        (df["high"] - df["close"].shift(1)).abs(),
        (df["low"]  - df["close"].shift(1)).abs(),
    ], axis=1).max(axis=1)
    df["atr_pct"]   = tr.rolling(14).mean() / df["close"]
    df["vol_ratio"]  = df["volume"] / df["volume"].rolling(20).mean()

    # Continuous label: future percentage return
    df["label"] = df["close"].shift(-horizon) / df["close"] - 1

    features = ["return_1", "return_5", "rsi_14", "sma_ratio", "atr_pct", "vol_ratio"]
    df.dropna(subset=features + ["label"], inplace=True)

    X, y = df[features].values, df["label"].values
    split = int(len(X) * 0.8)
    X_tr, X_te = X[:split], X[split:]
    y_tr, y_te = y[:split], y[split:]

    sc = StandardScaler()
    X_tr = sc.fit_transform(X_tr)
    X_te = sc.transform(X_te)
    return X_tr, X_te, y_tr, y_te, features

X_train, X_test, y_train, y_test, feat_names = build_regression_features(df)
print(f"Train: {len(X_train)} | Test: {len(X_test)}")
Train: 380 | Test: 96

6. Model Training and Evaluation

[5]
def regression_model(X_train, X_test, y_train, y_test,
                     signal_threshold: float = 0.002) -> dict:
    """
    Train and evaluate multiple regression models for return prediction.

    Core logic
    ----------
    1. Fit Ridge, Lasso, RandomForest, and GradientBoosting regressors.
    2. Predict continuous returns on the test set.
    3. Compute MSE, MAE, R² metrics.
    4. Derive directional signals by thresholding predicted returns.

    Parameters
    ----------
    X_train, X_test    : np.ndarray  Scaled feature matrices.
    y_train, y_test    : np.ndarray  Continuous return labels.
    signal_threshold   : float       Minimum predicted return to trigger a signal.

    Returns
    -------
    dict
        Model name → {'model', 'predictions', 'signals', 'metrics'}
    """
    models = {
        "Ridge":              Ridge(alpha=1.0),
        "Lasso":              Lasso(alpha=0.001, max_iter=5000),
        "Random Forest":      RandomForestRegressor(n_estimators=100, random_state=42),
        "Gradient Boosting":  GradientBoostingRegressor(n_estimators=100, random_state=42),
    }
    results = {}
    for name, model in models.items():
        model.fit(X_train, y_train)
        preds = model.predict(X_test)
        signals = np.where(preds > signal_threshold, 1,
                  np.where(preds < -signal_threshold, -1, 0))
        metrics = {
            "MSE":  mean_squared_error(y_test, preds),
            "RMSE": np.sqrt(mean_squared_error(y_test, preds)),
            "MAE":  mean_absolute_error(y_test, preds),
            "R2":   r2_score(y_test, preds),
        }
        results[name] = {"model": model, "predictions": preds,
                         "signals": signals, "metrics": metrics}
        print(f"\n{name}: MSE={metrics['MSE']:.6f}  MAE={metrics['MAE']:.6f}  R²={metrics['R2']:.4f}")
    return results

results = regression_model(X_train, X_test, y_train, y_test)

Ridge: MSE=0.000130  MAE=0.009493  R²=0.0151

Lasso: MSE=0.000130  MAE=0.009452  R²=0.0190

Random Forest: MSE=0.000165  MAE=0.010556  R²=-0.2449

Gradient Boosting: MSE=0.000161  MAE=0.010413  R²=-0.2178

7. Visualization — Predicted vs Actual Returns

[7]
fig = make_subplots(rows=2, cols=2,
    subplot_titles=list(results.keys()))

for i, (name, res) in enumerate(results.items()):
    row, col = i // 2 + 1, i % 2 + 1
    fig.add_trace(go.Scatter(y=y_test[:100],    mode="lines", name="Actual",
        line=dict(color="blue"),  showlegend=True), row=row, col=col)
    fig.add_trace(go.Scatter(y=res["predictions"][:100], mode="lines", name=name,
        line=dict(color="red"), showlegend=True), row=row, col=col)

fig.update_layout(title_text="Predicted vs Actual Returns (first 100 test bars)",
                  height=600)
fig.show()
[ ]