Signals·ML Models·Intermediate

Prepare ML Dataset

Full ML data prep pipeline — feature engineering from OHLCV, labeling, train/test split with time-series awareness, and scaling.

feature engineeringpreprocessingdataset

Signals — Prepare ML Dataset


1. Dependency Installation

[1]
!pip install pandas numpy plotly scikit-learn ta
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)
Requirement already satisfied: plotly in /usr/local/lib/python3.12/dist-packages (5.24.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.12/dist-packages (1.6.1)
Collecting ta
  Downloading ta-0.11.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.12/dist-packages (from plotly) (9.1.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.12/dist-packages (from plotly) (26.1)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.16.3)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (1.5.3)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn) (3.6.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Building wheels for collected packages: ta
  Building wheel for ta (setup.py) ... [?25l[?25hdone
  Created wheel for ta: filename=ta-0.11.0-py3-none-any.whl size=29412 sha256=8b8e9401669825e964102b767f48580aa8ccb61be615ad260102dbff9361269b
  Stored in directory: /root/.cache/pip/wheels/5c/a1/5f/c6b85a7d9452057be4ce68a8e45d77ba34234a6d46581777c6
Successfully built ta
Installing collected packages: ta
Successfully installed ta-0.11.0

2. Library Imports

[2]
import warnings; warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler

3. Strategy Overview

ML dataset preparation transforms raw OHLCV price data into a structured feature matrix suitable for supervised machine learning.

Pipeline stages:

  1. Feature engineering — compute technical indicators (returns, moving averages, RSI, ATR, Bollinger Bands, volume ratios) as input features.
  2. Label generation — assign a binary or ternary target based on forward returns over a prediction horizon.
  3. Cleaning — drop rows with NaN values introduced by rolling calculations.
  4. Splitting — partition into training and test sets using time-ordered (non-shuffled) split to prevent look-ahead leakage.
  5. Scaling — standardise features to zero mean and unit variance.

Limitation: All features must be computed using only past data (.shift(1) or .rolling()); any forward-looking calculation introduces data leakage that invalidates model evaluation.

4. Data Generation (Placeholder)

This section would typically contain code to load or generate the OHLCV data. Since the DATA_GEN_MD variable was not provided in the prompt, this is a placeholder.

[3]

# Creating a sample DataFrame for demonstration purposes
data = {
    'datetime': pd.to_datetime(pd.date_range(start='2023-01-01', periods=100)),
    'open': np.random.rand(100) * 100 + 100,
    'high': np.random.rand(100) * 100 + 101,
    'low': np.random.rand(100) * 100 + 99,
    'close': np.random.rand(100) * 100 + 100,
    'volume': np.random.rand(100) * 1000 + 1000
}
df = pd.DataFrame(data)

print("Sample DataFrame 'df' created.")
display(df.head())
Sample DataFrame 'df' created.
datetime open high low close volume
0 2023-01-01 160.326010 173.185240 105.571910 140.558648 1102.553941
1 2023-01-02 110.414987 169.407946 191.405764 122.385890 1908.042468
2 2023-01-03 150.977449 191.535590 172.245328 117.909869 1843.056606
3 2023-01-04 193.350720 146.142453 193.404375 156.360302 1647.888860
4 2023-01-05 106.808465 156.484021 189.638224 199.371700 1633.494423

5. Feature Engineering and Dataset Preparation

[7]
def prepare_ml_dataset(
    df: pd.DataFrame,
    horizon: int = 5,
    test_size: float = 0.2,
) -> tuple:
    """
    Engineer features from OHLCV data and prepare a supervised ML dataset.

    Core logic
    ----------
    1. Compute a set of lagged technical features from price and volume.
    2. Generate a forward-return binary label: +1 if close[t+horizon] > close[t], else 0.
    3. Drop NaN rows, split chronologically, and standardise via StandardScaler.

    Parameters
    ----------
    df : pd.DataFrame
        OHLCV DataFrame with columns: open, high, low, close, volume, datetime.
    horizon : int
        Number of bars ahead for the prediction target.
    test_size : float
        Fraction of data reserved for the test set (chronological tail).

    Returns
    -------
    tuple
        (X_train, X_test, y_train, y_test, feature_names, scaler)
    """
    df = df.copy().sort_values("datetime", ignore_index=True)

    # ── Feature engineering ──────────────────────────────────────────────────
    df["return_1"]  = df["close"].pct_change(1)           # 1-bar return
    df["return_5"]  = df["close"].pct_change(5)           # 5-bar return
    df["return_10"] = df["close"].pct_change(10)          # 10-bar return

    df["sma_10"] = df["close"].rolling(10).mean()
    df["sma_20"] = df["close"].rolling(20).mean()
    df["sma_ratio"] = df["sma_10"] / df["sma_20"]        # fast/slow MA ratio

    # RSI (14-period)
    delta = df["close"].diff()
    gain  = delta.clip(lower=0).rolling(14).mean()
    loss  = (-delta.clip(upper=0)).rolling(14).mean()
    rs    = gain / loss.replace(0, np.nan)
    df["rsi_14"] = 100 - (100 / (1 + rs))

    # ATR (14-period)
    tr = pd.concat([
        df["high"] - df["low"],
        (df["high"] - df["close"].shift(1)).abs(),
        (df["low"]  - df["close"].shift(1)).abs(),
    ], axis=1).max(axis=1)
    df["atr_14"] = tr.rolling(14).mean()
    df["atr_pct"] = df["atr_14"] / df["close"]           # normalised ATR

    # Bollinger Band width
    rolling_std   = df["close"].rolling(20).std()
    bb_upper      = df["sma_20"] + 2 * rolling_std
    bb_lower      = df["sma_20"] - 2 * rolling_std
    df["bb_width"] = (bb_upper - bb_lower) / df["sma_20"]

    # Volume ratio
    df["vol_ratio"] = df["volume"] / df["volume"].rolling(20).mean()

    # ── Label: forward return sign ───────────────────────────────────────────
    df["future_return"] = df["close"].shift(-horizon) / df["close"] - 1
    df["label"] = (df["future_return"] > 0).astype(int)  # 1 = up, 0 = down

    # ── Drop NaN rows ────────────────────────────────────────────────────────
    feature_cols = ["return_1", "return_5", "return_10", "sma_ratio",
                    "rsi_14", "atr_pct", "bb_width", "vol_ratio"]
    df.dropna(subset=feature_cols + ["label"], inplace=True)

    X = df[feature_cols].values
    y = df["label"].values

    # ── Chronological train/test split ───────────────────────────────────────
    split = int(len(X) * (1 - test_size))
    X_train, X_test = X[:split], X[split:]
    y_train, y_test = y[:split], y[split:]

    # ── Feature standardisation ──────────────────────────────────────────────
    scaler  = StandardScaler()
    X_train = scaler.fit_transform(X_train)   # fit on training data only
    X_test  = scaler.transform(X_test)        # apply same transform to test data

    return X_train, X_test, y_train, y_test, feature_cols, scaler, df

X_train, X_test, y_train, y_test, feature_names, scaler, df_feat = prepare_ml_dataset(df, horizon=5)

print(f"Training samples : {len(X_train)}")
print(f"Test samples     : {len(X_test)}")
print(f"Features         : {feature_names}")
print(f"Label balance (train): {pd.Series(y_train).value_counts().to_dict()}")
Training samples : 64
Test samples     : 17
Features         : ['return_1', 'return_5', 'return_10', 'sma_ratio', 'rsi_14', 'atr_pct', 'bb_width', 'vol_ratio']
Label balance (train): {0: 34, 1: 30}

Explanation:

  • shift(-horizon): Forward shift computes future price without look-ahead at inference time — this label is valid only for training; in live trading the label is unavailable.
  • scaler.fit_transform(X_train) then scaler.transform(X_test): The scaler is fitted exclusively on training data so test-set statistics do not leak into the normalisation parameters.

6. Visualization — Feature Distributions

[5]
import pandas as pd

feat_df = pd.DataFrame(X_train, columns=feature_names)

fig = make_subplots(rows=2, cols=4,
    subplot_titles=feature_names)

for i, col in enumerate(feature_names):
    row = i // 4 + 1
    c    = i %  4 + 1
    fig.add_trace(go.Histogram(x=feat_df[col], name=col, showlegend=False,
                               marker_color="steelblue"), row=row, col=c)

fig.update_layout(title_text="Scaled Feature Distributions (Training Set)",
                  height=500)
fig.show()