Meta-Labeling

Apply Marcos Lopez de Prado's meta-labeling framework to add a secondary ML model that filters primary strategy signals.

meta-labelingMLsignal filtering

Meta Labeling: A Two-Stage Machine Learning Approach

Meta Labeling is a two-stage machine learning approach designed to enhance the profitability of trading signals. A primary model generates directional signals (+1 for buy / -1 for sell), and a secondary (meta) model predicts whether each primary signal will be profitable. Low-confidence signals are suppressed, preventing trades based on unreliable predictions.

Architecture

Primary Model: Generates the initial directional signal (e.g., based on an EMA crossover strategy).
Meta Model: A binary classifier trained on technical features to predict the probability of the primary signal being correct (profitable).
Final Signal: The original directional signal is multiplied by the meta model's predicted binary outcome (1 if profitable, 0 otherwise). Signals with meta-model confidence below a specified threshold are filtered out.

Advantages

Preservation of Primary Logic: Maintains the core directional logic of the primary model.
Simplified Classification: The meta model only needs to decide whether to 'bet' on a signal or 'abstain,' simplifying the binary classification problem.
Asymmetric Imbalance Handling: Naturally manages class imbalance, common in financial data where profitable signals are often scarcer than unprofitable ones.

Limitation

Primary Model Reliability: Requires a sufficiently reliable primary model; if the primary signals are random, the meta model cannot learn useful patterns.

1. Environment Setup

[12]

import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

2. Data Preparation

This section generates a synthetic dataset resembling OHLCV (Open, High, Low, Close, Volume) financial data. This sample data is used for demonstration purposes. Users should replace this with their actual financial time-series data, ensuring it includes datetime, open, high, low, close, and volume columns.

[17]

# Generate synthetic OHLCV data with more fluctuation
np.random.seed(42)
num_points = 2000 # Increased from 500 to 2000 for robustness
dates = pd.date_range(start='2023-01-01', periods=num_points, freq='H')

# Start with a base price and apply random daily changes
base_price = 100
price_changes = np.random.normal(0, 1, num_points).cumsum() # Random walk centered around 0
# Scale and shift to get prices in a reasonable range, ensuring fluctuation
open_prices = base_price + price_changes * 5 # Scale factor for magnitude of changes
open_prices = np.maximum(open_prices, 50) # Ensure prices don't go too low

close_prices = open_prices + np.random.uniform(-2, 2, num_points)
high_prices = np.maximum(open_prices, close_prices) + np.random.uniform(0, 1, num_points)
low_prices = np.minimum(open_prices, close_prices) - np.random.uniform(0, 1, num_points)
volume = np.random.randint(1000, 5000, num_points)

df = pd.DataFrame({
    'datetime': dates,
    'open': open_prices,
    'high': high_prices,
    'low': low_prices,
    'close': close_prices,
    'volume': volume
})

print("Sample DataFrame head:")
display(df.head())
print("Sample DataFrame info:")
display(df.info())

Sample DataFrame head:

/tmp/ipykernel_1268/2453730600.py:4: FutureWarning:

'H' is deprecated and will be removed in a future version, please use 'h' instead.

	datetime	open	high	low	close	volume
0	2023-01-01 00:00:00	102.483571	103.152447	101.967560	102.111997	4467
1	2023-01-01 01:00:00	101.792249	102.590905	99.614222	100.056289	2244
2	2023-01-01 02:00:00	105.030692	105.963445	104.091869	104.425974	2451
3	2023-01-01 03:00:00	112.645841	112.665979	110.568003	111.089834	4823
4	2023-01-01 04:00:00	111.475074	112.861799	110.889420	112.708015	2166

Sample DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   datetime  2000 non-null   datetime64[ns]
 1   open      2000 non-null   float64       
 2   high      2000 non-null   float64       
 3   low       2000 non-null   float64       
 4   close     2000 non-null   float64       
 5   volume    2000 non-null   int64         
dtypes: datetime64[ns](1), float64(4), int64(1)
memory usage: 93.9 KB

None

3. Meta Labeling Function Definition

The meta_labeling function processes OHLCV data to generate and filter trading signals using a two-stage approach. It first identifies primary signals (e.g., EMA crossovers) and then trains a meta-model to predict the profitability of these primary signals. The final signal is produced by gating the primary signal with the meta-model's confidence.

Core Logic

Primary Signal Generation: Exponential Moving Averages (EMA) are calculated for fast and slow periods. A primary signal of +1 is generated when the fast EMA crosses above the slow EMA, and -1 when it crosses below.
Meta Label Construction: A binary 'meta label' is created to indicate whether the primary signal was profitable over a specified future horizon. A primary buy signal is profitable if the price increases, and a primary sell signal is profitable if the price decreases.
Meta Feature Engineering: Technical indicators such as returns, Relative Strength Index (RSI), Simple Moving Average (SMA) ratio, and volume ratio are calculated to serve as features for the meta-model.
Meta Model Training: A GradientBoostingClassifier is trained on these features, with the meta label as the target, to predict the probability of a primary signal being profitable.
Signal Alignment: The predicted meta probabilities are aligned back to the original DataFrame.
Final Signal Generation: The primary signal is filtered based on the meta-model's predicted probability. Only signals with a meta probability exceeding meta_threshold are considered valid, resulting in a final signal that accounts for both direction and predicted profitability.

[14]

def meta_labeling(
    df: pd.DataFrame,
    ema_fast: int = 12,
    ema_slow: int = 26,
    horizon: int = 5,
    meta_threshold: float = 0.55,
) -> pd.DataFrame:
    """
    Applies meta labeling by training a secondary model to filter a primary EMA signal.

    Parameters
    ----------
    df : pd.DataFrame
        OHLCV DataFrame with 'datetime', 'open', 'high', 'low', 'close', and 'volume' columns.
    ema_fast : int
        Period for the fast Exponential Moving Average (EMA) in the primary signal.
    ema_slow : int
        Period for the slow Exponential Moving Average (EMA) in the primary signal.
    horizon : int
        Number of bars to hold a trade for profitability calculation of the meta label.
    meta_threshold : float
        Minimum meta probability required for a primary signal to be accepted.

    Returns
    -------
    pd.DataFrame
        Original DataFrame with added columns: 'ema_fast', 'ema_slow', 'primary_signal',
        'meta_label', 'return_1', 'return_5', 'rsi', 'sma_ratio', 'vol_ratio',
        'meta_proba', and 'signal'.
    """
    df = df.copy().sort_values("datetime", ignore_index=True)

    # Primary signal (EMA cross)
    df["ema_fast"] = df["close"].ewm(span=ema_fast, adjust=False).mean()
    df["ema_slow"] = df["close"].ewm(span=ema_slow, adjust=False).mean()
    df["primary_signal"] = np.where(df["ema_fast"] > df["ema_slow"], 1, -1)

    # Meta label: was the primary signal profitable?
    future_return   = df["close"].shift(-horizon) / df["close"] - 1
    # Profitable if primary direction matches forward return sign
    df["meta_label"] = (
        (df["primary_signal"] == 1)  & (future_return > 0) |
        (df["primary_signal"] == -1) & (future_return < 0)
    ).astype(int)

    # Meta features
    df["return_1"]  = df["close"].pct_change(1)
    df["return_5"]  = df["close"].pct_change(5)

    # Improved RSI calculation to handle zero division and NaNs more robustly
    delta = df["close"].diff()
    gain  = delta.clip(lower=0)
    loss  = -delta.clip(upper=0)

    avg_gain = gain.rolling(14).mean()
    avg_loss = loss.rolling(14).mean().abs() # Ensure avg_loss is positive

    # Calculate RS, handling division by zero with a small epsilon or assigning specific values
    rs = avg_gain / avg_loss.replace(0, 1e-9) # Replace 0 with a small number to avoid div by zero
    df["rsi"] = 100 - (100 / (1 + rs))

    # Special cases: if no average loss, RSI is 100; if no average gain, RSI is 0
    df.loc[avg_loss == 0, "rsi"] = 100
    df.loc[avg_gain == 0, "rsi"] = 0
    df["rsi"] = df["rsi"].fillna(50) # Fill any remaining NaNs (e.g., initial period) with a neutral 50

    df["sma_ratio"] = df["close"].rolling(10).mean() / df["close"].rolling(20).mean()
    df["vol_ratio"] = df["volume"] / df["volume"].rolling(20).mean()

    feats = ["return_1", "return_5", "rsi", "sma_ratio", "vol_ratio"]

    print(f"DEBUG_PRE_DROPNA: df shape: {df.shape}")
    print(f"DEBUG_PRE_DROPNA: NaNs in feats and meta_label columns before dropna:")
    for col in feats + ["meta_label"]:
        print(f"  {col}: {df[col].isna().sum()} NaNs")

    df_clean = df.dropna(subset=feats + ["meta_label"]).copy()

    X = df_clean[feats].values
    y = df_clean["meta_label"].values
    split = int(len(X) * 0.8)

    print(f"DEBUG: Length of df_clean after dropna: {len(df_clean)}")
    print(f"DEBUG: Length of X (features array): {len(X)}")
    print(f"DEBUG: Calculated split point (80% for train): {split}")
    print(f"DEBUG: Number of training samples (X[:split]): {len(X[:split]) if X is not None else 0}")
    print(f"DEBUG: Number of testing samples (X[split:]): {len(X[split:]) if X is not None else 0}")

    # Robustness check for insufficient data for both training and testing sets
    num_samples_X = len(X)
    num_train_samples = split
    num_test_samples = num_samples_X - split

    if num_samples_X == 0:
        print(f"\nWarning: After dropping NaNs, the dataset for features is empty. Cannot train meta-model.")
        df["meta_proba"] = np.nan
        df["signal"] = 0
        return df

    if num_train_samples == 0:
        print(f"\nWarning: The training set is empty after splitting ({num_train_samples} samples). Cannot train meta-model.")
        df["meta_proba"] = np.nan
        df["signal"] = 0
        return df

    if num_test_samples == 0:
        print(f"\nWarning: The test set is empty after splitting ({num_test_samples} samples). Cannot evaluate meta-model.")
        df["meta_proba"] = np.nan
        df["signal"] = 0
        return df

    # Check for sufficient classes in the training set
    unique_classes_train = np.unique(y[:split])
    if len(unique_classes_train) < 2:
        print(f"\nWarning: Training set for meta-model contains only 1 class (all {unique_classes_train[0]}s). Cannot train meta-model.")
        df["meta_proba"] = np.nan
        df["signal"] = 0
        return df

    print(f"DEBUG: Before StandardScaler fit_transform. X[:split].shape: {X[:split].shape}") # New debug line
    sc = StandardScaler()
    X_tr = sc.fit_transform(X[:split])
    X_te = sc.transform(X[split:])

    meta_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
    meta_clf.fit(X_tr, y[:split])

    meta_proba = meta_clf.predict_proba(X_te)[:, 1]

    # Align meta probabilities back to DataFrame
    test_idx = df_clean.index[split:]
    df["meta_proba"] = np.nan
    df.loc[test_idx, "meta_proba"] = meta_proba

    # Final signal: primary direction, gated by meta confidence
    df["signal"] = 0
    df.loc[(df["meta_proba"] >= meta_threshold) & (df["primary_signal"] == 1),  "signal"] =  1
    df.loc[(df["meta_proba"] >= meta_threshold) & (df["primary_signal"] == -1), "signal"] = -1

    print("\nClassification Report for Meta Model (Test Set):")
    print(classification_report(y[split:], (meta_proba >= meta_threshold).astype(int)))
    return df

4. Applying Meta Labeling

Execute the meta_labeling function on the prepared DataFrame. This step generates the primary signals, meta labels, meta features, and the final filtered signals based on the meta-model's predictions. The distribution of the final signals is then displayed.

[15]

df_signals = meta_labeling(df, meta_threshold=0.55)
print("\n--- Final Signal Distribution ---")
print(df_signals["signal"].value_counts())

DEBUG_PRE_DROPNA: df shape: (2000, 15)
DEBUG_PRE_DROPNA: NaNs in feats and meta_label columns before dropna:
  return_1: 1 NaNs
  return_5: 5 NaNs
  rsi: 0 NaNs
  sma_ratio: 19 NaNs
  vol_ratio: 19 NaNs
  meta_label: 0 NaNs
DEBUG: Length of df_clean after dropna: 1981
DEBUG: Length of X (features array): 1981
DEBUG: Calculated split point (80% for train): 1584
DEBUG: Number of training samples (X[:split]): 1584
DEBUG: Number of testing samples (X[split:]): 397

Warning: Training set for meta-model contains only 1 class (all 1s). Cannot train meta-model.

--- Final Signal Distribution ---
signal
0    2000
Name: count, dtype: int64

5. Interpretation of Meta Labels and Thresholds

Meta Label Construction: The profitability of a primary signal is determined by actual forward returns. This ensures the meta model learns to differentiate between primary signals that historically led to profits versus losses.
meta_threshold: A parameter that defines the minimum confidence level (probability) from the meta model required for a primary signal to be acted upon. A meta_threshold of 0.55 indicates that only signals with at least a 55% predicted probability of being correct are considered valid. Higher values for this threshold typically reduce the number of generated signals but aim to improve the precision of the accepted signals.

6. Signal Visualization

This section visualizes the generated buy and sell signals in conjunction with the price data. The plotly library is used to create an interactive candlestick chart, overlaid with the meta-labeled buy (triangle-up) and sell (triangle-down) signals. A separate subplot displays the meta-model's probability (meta_proba), with a horizontal line indicating the meta_threshold.

[16]

buy_signals  = df_signals[df_signals["signal"] ==  1]
sell_signals = df_signals[df_signals["signal"] == -1]

fig = make_subplots(rows=2, cols=1, shared_xaxes=True,
    subplot_titles=["Price + Meta-Labeled Signals", "Meta Probability"],
    row_heights=[0.65, 0.35])

fig.add_trace(go.Candlestick(x=df_signals["datetime"],
    open=df_signals["open"], high=df_signals["high"],
    low=df_signals["low"], close=df_signals["close"], name="Price"), row=1, col=1)

fig.add_trace(go.Scatter(x=buy_signals["datetime"],  y=buy_signals["low"]  * 0.999,
    mode="markers", marker=dict(symbol="triangle-up",   size=10, color="green"), name="Buy"),  row=1, col=1)

fig.add_trace(go.Scatter(x=sell_signals["datetime"], y=sell_signals["high"] * 1.001,
    mode="markers", marker=dict(symbol="triangle-down", size=10, color="red"),   name="Sell"), row=1, col=1)

fig.add_trace(go.Scatter(x=df_signals["datetime"], y=df_signals["meta_proba"],
    mode="lines", name="Meta P(correct)", line=dict(color="purple")), row=2, col=1)

fig.add_hline(y=0.55, line_dash="dot", line_color="orange", row=2, col=1)

fig.update_layout(title_text="Meta Labeling Results",
    xaxis_rangeslider_visible=False, height=700, xaxis2_title="Datetime",
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1))

fig.show()