Signals·ML Models·Intermediate
Regression Signal Model
Train a regression model to predict next-bar returns from technical features — with walk-forward cross-validation and feature importance.
regressionpredictionfeatures
Signals — Regression Models
1. Dependency Installation
[1]
!pip install pandas numpy plotly scikit-learnRequirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2) Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2) Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (5.24.1) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.12/dist-packages (1.6.1) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.12/dist-packages (from plotly) (9.1.4) Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from plotly) (26.1) Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.16.3) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.5.3) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (3.6.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
2. Library Imports
[2]
import warnings; warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler3. Strategy Overview
Regression models predict the magnitude of future price returns (continuous output) rather than a binary direction. The predicted return can then be thresholded to generate trading signals.
Models evaluated:
- Ridge Regression: L2-regularised linear model; stable under multicollinearity.
- Lasso Regression: L1-regularised; performs automatic feature selection via zero coefficients.
- Random Forest Regressor: Non-linear ensemble; captures complex feature interactions.
- Gradient Boosting Regressor: Boosted tree ensemble; typically the highest accuracy.
Evaluation metrics:
- MSE / RMSE: Penalises large errors heavily.
- MAE: Average magnitude of errors; more robust to outliers than RMSE.
- R²: Proportion of variance explained; 1.0 = perfect fit.
Signal derivation: Predicted return > signal_threshold → Buy (+1); < −signal_threshold → Sell (−1); otherwise no signal (0).
4. Data Generation
[3]
def generate_data(periods: int) -> pd.DataFrame:
"""
Generate synthetic OHLCV price data using a geometric random walk.
Parameters
----------
periods : int
Number of 1-minute bars to generate.
Returns
-------
pd.DataFrame
DataFrame with columns: open, high, low, close, volume, datetime.
"""
start_date = pd.to_datetime("2024-01-01 00:00:00+00:00")
datetime_index = pd.date_range(start_date, periods=periods, freq="1min", tz="UTC")
price_data = []
last_close = 42000
for i in range(periods):
open_price = last_close + np.random.normal(0, last_close * 0.0005)
close_price = open_price + np.random.normal(0, last_close * 0.005)
body_high = max(open_price, close_price)
body_low = min(open_price, close_price)
high_price = max(body_high + abs(np.random.normal(0, last_close * 0.002)), open_price, close_price)
low_price = min(body_low - abs(np.random.normal(0, last_close * 0.002)), open_price, close_price)
if high_price < low_price:
high_price, low_price = low_price, high_price
price_data.append({
"open": max(1, int(open_price)),
"high": max(1, int(high_price)),
"low": max(1, int(low_price)),
"close": max(1, int(close_price)),
})
last_close = close_price
df = pd.DataFrame(price_data, index=datetime_index)
df.index.name = "datetime"
df["volume"] = np.random.uniform(100.0, 500.0, periods)
df["datetime"] = df.index.to_series()
return df.reset_index(drop=True)
df = generate_data(500)
display(df.head())| open | high | low | close | volume | datetime | |
|---|---|---|---|---|---|---|
| 0 | 41998 | 42037 | 41919 | 42019 | 417.369207 | 2024-01-01 00:00:00+00:00 |
| 1 | 42007 | 42374 | 41880 | 42200 | 336.167065 | 2024-01-01 00:01:00+00:00 |
| 2 | 42238 | 42606 | 42199 | 42460 | 257.399884 | 2024-01-01 00:02:00+00:00 |
| 3 | 42490 | 42659 | 42156 | 42211 | 108.553065 | 2024-01-01 00:03:00+00:00 |
| 4 | 42215 | 42379 | 42121 | 42273 | 469.966533 | 2024-01-01 00:04:00+00:00 |
5. Feature Engineering (Shared Pipeline)
[4]
def build_regression_features(df: pd.DataFrame, horizon: int = 5) -> tuple:
"""
Compute technical features and continuous forward-return labels.
Parameters
----------
df : pd.DataFrame OHLCV DataFrame.
horizon : int Prediction horizon in bars.
Returns
-------
tuple
(X_train, X_test, y_train, y_test, feature_names)
"""
df = df.copy().sort_values("datetime", ignore_index=True)
df["return_1"] = df["close"].pct_change(1)
df["return_5"] = df["close"].pct_change(5)
delta = df["close"].diff()
gain = delta.clip(lower=0).rolling(14).mean()
loss = (-delta.clip(upper=0)).rolling(14).mean()
df["rsi_14"] = 100 - 100 / (1 + gain / loss.replace(0, np.nan))
df["sma_ratio"] = df["close"].rolling(10).mean() / df["close"].rolling(20).mean()
tr = pd.concat([
df["high"] - df["low"],
(df["high"] - df["close"].shift(1)).abs(),
(df["low"] - df["close"].shift(1)).abs(),
], axis=1).max(axis=1)
df["atr_pct"] = tr.rolling(14).mean() / df["close"]
df["vol_ratio"] = df["volume"] / df["volume"].rolling(20).mean()
# Continuous label: future percentage return
df["label"] = df["close"].shift(-horizon) / df["close"] - 1
features = ["return_1", "return_5", "rsi_14", "sma_ratio", "atr_pct", "vol_ratio"]
df.dropna(subset=features + ["label"], inplace=True)
X, y = df[features].values, df["label"].values
split = int(len(X) * 0.8)
X_tr, X_te = X[:split], X[split:]
y_tr, y_te = y[:split], y[split:]
sc = StandardScaler()
X_tr = sc.fit_transform(X_tr)
X_te = sc.transform(X_te)
return X_tr, X_te, y_tr, y_te, features
X_train, X_test, y_train, y_test, feat_names = build_regression_features(df)
print(f"Train: {len(X_train)} | Test: {len(X_test)}")Train: 380 | Test: 96
6. Model Training and Evaluation
[5]
def regression_model(X_train, X_test, y_train, y_test,
signal_threshold: float = 0.002) -> dict:
"""
Train and evaluate multiple regression models for return prediction.
Core logic
----------
1. Fit Ridge, Lasso, RandomForest, and GradientBoosting regressors.
2. Predict continuous returns on the test set.
3. Compute MSE, MAE, R² metrics.
4. Derive directional signals by thresholding predicted returns.
Parameters
----------
X_train, X_test : np.ndarray Scaled feature matrices.
y_train, y_test : np.ndarray Continuous return labels.
signal_threshold : float Minimum predicted return to trigger a signal.
Returns
-------
dict
Model name → {'model', 'predictions', 'signals', 'metrics'}
"""
models = {
"Ridge": Ridge(alpha=1.0),
"Lasso": Lasso(alpha=0.001, max_iter=5000),
"Random Forest": RandomForestRegressor(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
preds = model.predict(X_test)
signals = np.where(preds > signal_threshold, 1,
np.where(preds < -signal_threshold, -1, 0))
metrics = {
"MSE": mean_squared_error(y_test, preds),
"RMSE": np.sqrt(mean_squared_error(y_test, preds)),
"MAE": mean_absolute_error(y_test, preds),
"R2": r2_score(y_test, preds),
}
results[name] = {"model": model, "predictions": preds,
"signals": signals, "metrics": metrics}
print(f"\n{name}: MSE={metrics['MSE']:.6f} MAE={metrics['MAE']:.6f} R²={metrics['R2']:.4f}")
return results
results = regression_model(X_train, X_test, y_train, y_test)Ridge: MSE=0.000130 MAE=0.009493 R²=0.0151 Lasso: MSE=0.000130 MAE=0.009452 R²=0.0190 Random Forest: MSE=0.000165 MAE=0.010556 R²=-0.2449 Gradient Boosting: MSE=0.000161 MAE=0.010413 R²=-0.2178
7. Visualization — Predicted vs Actual Returns
[7]
fig = make_subplots(rows=2, cols=2,
subplot_titles=list(results.keys()))
for i, (name, res) in enumerate(results.items()):
row, col = i // 2 + 1, i % 2 + 1
fig.add_trace(go.Scatter(y=y_test[:100], mode="lines", name="Actual",
line=dict(color="blue"), showlegend=True), row=row, col=col)
fig.add_trace(go.Scatter(y=res["predictions"][:100], mode="lines", name=name,
line=dict(color="red"), showlegend=True), row=row, col=col)
fig.update_layout(title_text="Predicted vs Actual Returns (first 100 test bars)",
height=600)
fig.show()[ ]