Signals·ML Models·Intermediate
Prepare ML Dataset
Full ML data prep pipeline — feature engineering from OHLCV, labeling, train/test split with time-series awareness, and scaling.
feature engineeringpreprocessingdataset
Signals — Prepare ML Dataset
1. Dependency Installation
[1]
!pip install pandas numpy plotly scikit-learn taRequirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2) Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2) Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (5.24.1) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.12/dist-packages (1.6.1) Collecting ta Downloading ta-0.11.0.tar.gz (25 kB) Preparing metadata (setup.py) ... [?25l[?25hdone Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.12/dist-packages (from plotly) (9.1.4) Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from plotly) (26.1) Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.16.3) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.5.3) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (3.6.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0) Building wheels for collected packages: ta Building wheel for ta (setup.py) ... [?25l[?25hdone Created wheel for ta: filename=ta-0.11.0-py3-none-any.whl size=29412 sha256=8b8e9401669825e964102b767f48580aa8ccb61be615ad260102dbff9361269b Stored in directory: /root/.cache/pip/wheels/5c/a1/5f/c6b85a7d9452057be4ce68a8e45d77ba34234a6d46581777c6 Successfully built ta Installing collected packages: ta Successfully installed ta-0.11.0
2. Library Imports
[2]
import warnings; warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler3. Strategy Overview
ML dataset preparation transforms raw OHLCV price data into a structured feature matrix suitable for supervised machine learning.
Pipeline stages:
- Feature engineering — compute technical indicators (returns, moving averages, RSI, ATR, Bollinger Bands, volume ratios) as input features.
- Label generation — assign a binary or ternary target based on forward returns over a prediction horizon.
- Cleaning — drop rows with NaN values introduced by rolling calculations.
- Splitting — partition into training and test sets using time-ordered (non-shuffled) split to prevent look-ahead leakage.
- Scaling — standardise features to zero mean and unit variance.
Limitation: All features must be computed using only past data (.shift(1) or .rolling()); any forward-looking calculation introduces data leakage that invalidates model evaluation.
4. Data Generation (Placeholder)
This section would typically contain code to load or generate the OHLCV data. Since the DATA_GEN_MD variable was not provided in the prompt, this is a placeholder.
[3]
# Creating a sample DataFrame for demonstration purposes
data = {
'datetime': pd.to_datetime(pd.date_range(start='2023-01-01', periods=100)),
'open': np.random.rand(100) * 100 + 100,
'high': np.random.rand(100) * 100 + 101,
'low': np.random.rand(100) * 100 + 99,
'close': np.random.rand(100) * 100 + 100,
'volume': np.random.rand(100) * 1000 + 1000
}
df = pd.DataFrame(data)
print("Sample DataFrame 'df' created.")
display(df.head())Sample DataFrame 'df' created.
| datetime | open | high | low | close | volume | |
|---|---|---|---|---|---|---|
| 0 | 2023-01-01 | 160.326010 | 173.185240 | 105.571910 | 140.558648 | 1102.553941 |
| 1 | 2023-01-02 | 110.414987 | 169.407946 | 191.405764 | 122.385890 | 1908.042468 |
| 2 | 2023-01-03 | 150.977449 | 191.535590 | 172.245328 | 117.909869 | 1843.056606 |
| 3 | 2023-01-04 | 193.350720 | 146.142453 | 193.404375 | 156.360302 | 1647.888860 |
| 4 | 2023-01-05 | 106.808465 | 156.484021 | 189.638224 | 199.371700 | 1633.494423 |
5. Feature Engineering and Dataset Preparation
[7]
def prepare_ml_dataset(
df: pd.DataFrame,
horizon: int = 5,
test_size: float = 0.2,
) -> tuple:
"""
Engineer features from OHLCV data and prepare a supervised ML dataset.
Core logic
----------
1. Compute a set of lagged technical features from price and volume.
2. Generate a forward-return binary label: +1 if close[t+horizon] > close[t], else 0.
3. Drop NaN rows, split chronologically, and standardise via StandardScaler.
Parameters
----------
df : pd.DataFrame
OHLCV DataFrame with columns: open, high, low, close, volume, datetime.
horizon : int
Number of bars ahead for the prediction target.
test_size : float
Fraction of data reserved for the test set (chronological tail).
Returns
-------
tuple
(X_train, X_test, y_train, y_test, feature_names, scaler)
"""
df = df.copy().sort_values("datetime", ignore_index=True)
# ── Feature engineering ──────────────────────────────────────────────────
df["return_1"] = df["close"].pct_change(1) # 1-bar return
df["return_5"] = df["close"].pct_change(5) # 5-bar return
df["return_10"] = df["close"].pct_change(10) # 10-bar return
df["sma_10"] = df["close"].rolling(10).mean()
df["sma_20"] = df["close"].rolling(20).mean()
df["sma_ratio"] = df["sma_10"] / df["sma_20"] # fast/slow MA ratio
# RSI (14-period)
delta = df["close"].diff()
gain = delta.clip(lower=0).rolling(14).mean()
loss = (-delta.clip(upper=0)).rolling(14).mean()
rs = gain / loss.replace(0, np.nan)
df["rsi_14"] = 100 - (100 / (1 + rs))
# ATR (14-period)
tr = pd.concat([
df["high"] - df["low"],
(df["high"] - df["close"].shift(1)).abs(),
(df["low"] - df["close"].shift(1)).abs(),
], axis=1).max(axis=1)
df["atr_14"] = tr.rolling(14).mean()
df["atr_pct"] = df["atr_14"] / df["close"] # normalised ATR
# Bollinger Band width
rolling_std = df["close"].rolling(20).std()
bb_upper = df["sma_20"] + 2 * rolling_std
bb_lower = df["sma_20"] - 2 * rolling_std
df["bb_width"] = (bb_upper - bb_lower) / df["sma_20"]
# Volume ratio
df["vol_ratio"] = df["volume"] / df["volume"].rolling(20).mean()
# ── Label: forward return sign ───────────────────────────────────────────
df["future_return"] = df["close"].shift(-horizon) / df["close"] - 1
df["label"] = (df["future_return"] > 0).astype(int) # 1 = up, 0 = down
# ── Drop NaN rows ────────────────────────────────────────────────────────
feature_cols = ["return_1", "return_5", "return_10", "sma_ratio",
"rsi_14", "atr_pct", "bb_width", "vol_ratio"]
df.dropna(subset=feature_cols + ["label"], inplace=True)
X = df[feature_cols].values
y = df["label"].values
# ── Chronological train/test split ───────────────────────────────────────
split = int(len(X) * (1 - test_size))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
# ── Feature standardisation ──────────────────────────────────────────────
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit on training data only
X_test = scaler.transform(X_test) # apply same transform to test data
return X_train, X_test, y_train, y_test, feature_cols, scaler, df
X_train, X_test, y_train, y_test, feature_names, scaler, df_feat = prepare_ml_dataset(df, horizon=5)
print(f"Training samples : {len(X_train)}")
print(f"Test samples : {len(X_test)}")
print(f"Features : {feature_names}")
print(f"Label balance (train): {pd.Series(y_train).value_counts().to_dict()}")Training samples : 64
Test samples : 17
Features : ['return_1', 'return_5', 'return_10', 'sma_ratio', 'rsi_14', 'atr_pct', 'bb_width', 'vol_ratio']
Label balance (train): {0: 34, 1: 30}
Explanation:
shift(-horizon): Forward shift computes future price without look-ahead at inference time — this label is valid only for training; in live trading the label is unavailable.scaler.fit_transform(X_train)thenscaler.transform(X_test): The scaler is fitted exclusively on training data so test-set statistics do not leak into the normalisation parameters.
6. Visualization — Feature Distributions
[5]
import pandas as pd
feat_df = pd.DataFrame(X_train, columns=feature_names)
fig = make_subplots(rows=2, cols=4,
subplot_titles=feature_names)
for i, col in enumerate(feature_names):
row = i // 4 + 1
c = i % 4 + 1
fig.add_trace(go.Histogram(x=feat_df[col], name=col, showlegend=False,
marker_color="steelblue"), row=row, col=c)
fig.update_layout(title_text="Scaled Feature Distributions (Training Set)",
height=500)
fig.show()