Meta-Labeling
Apply Marcos Lopez de Prado's meta-labeling framework to add a secondary ML model that filters primary strategy signals.
Meta Labeling: A Two-Stage Machine Learning Approach
Meta Labeling is a two-stage machine learning approach designed to enhance the profitability of trading signals. A primary model generates directional signals (+1 for buy / -1 for sell), and a secondary (meta) model predicts whether each primary signal will be profitable. Low-confidence signals are suppressed, preventing trades based on unreliable predictions.
Architecture
- Primary Model: Generates the initial directional signal (e.g., based on an EMA crossover strategy).
- Meta Model: A binary classifier trained on technical features to predict the probability of the primary signal being correct (profitable).
- Final Signal: The original directional signal is multiplied by the meta model's predicted binary outcome (1 if profitable, 0 otherwise). Signals with meta-model confidence below a specified threshold are filtered out.
Advantages
- Preservation of Primary Logic: Maintains the core directional logic of the primary model.
- Simplified Classification: The meta model only needs to decide whether to 'bet' on a signal or 'abstain,' simplifying the binary classification problem.
- Asymmetric Imbalance Handling: Naturally manages class imbalance, common in financial data where profitable signals are often scarcer than unprofitable ones.
Limitation
- Primary Model Reliability: Requires a sufficiently reliable primary model; if the primary signals are random, the meta model cannot learn useful patterns.
1. Environment Setup
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report2. Data Preparation
This section generates a synthetic dataset resembling OHLCV (Open, High, Low, Close, Volume) financial data. This sample data is used for demonstration purposes. Users should replace this with their actual financial time-series data, ensuring it includes datetime, open, high, low, close, and volume columns.
# Generate synthetic OHLCV data with more fluctuation
np.random.seed(42)
num_points = 2000 # Increased from 500 to 2000 for robustness
dates = pd.date_range(start='2023-01-01', periods=num_points, freq='H')
# Start with a base price and apply random daily changes
base_price = 100
price_changes = np.random.normal(0, 1, num_points).cumsum() # Random walk centered around 0
# Scale and shift to get prices in a reasonable range, ensuring fluctuation
open_prices = base_price + price_changes * 5 # Scale factor for magnitude of changes
open_prices = np.maximum(open_prices, 50) # Ensure prices don't go too low
close_prices = open_prices + np.random.uniform(-2, 2, num_points)
high_prices = np.maximum(open_prices, close_prices) + np.random.uniform(0, 1, num_points)
low_prices = np.minimum(open_prices, close_prices) - np.random.uniform(0, 1, num_points)
volume = np.random.randint(1000, 5000, num_points)
df = pd.DataFrame({
'datetime': dates,
'open': open_prices,
'high': high_prices,
'low': low_prices,
'close': close_prices,
'volume': volume
})
print("Sample DataFrame head:")
display(df.head())
print("Sample DataFrame info:")
display(df.info())Sample DataFrame head:
/tmp/ipykernel_1268/2453730600.py:4: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
| datetime | open | high | low | close | volume | |
|---|---|---|---|---|---|---|
| 0 | 2023-01-01 00:00:00 | 102.483571 | 103.152447 | 101.967560 | 102.111997 | 4467 |
| 1 | 2023-01-01 01:00:00 | 101.792249 | 102.590905 | 99.614222 | 100.056289 | 2244 |
| 2 | 2023-01-01 02:00:00 | 105.030692 | 105.963445 | 104.091869 | 104.425974 | 2451 |
| 3 | 2023-01-01 03:00:00 | 112.645841 | 112.665979 | 110.568003 | 111.089834 | 4823 |
| 4 | 2023-01-01 04:00:00 | 111.475074 | 112.861799 | 110.889420 | 112.708015 | 2166 |
Sample DataFrame info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 2000 entries, 0 to 1999 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 2000 non-null datetime64[ns] 1 open 2000 non-null float64 2 high 2000 non-null float64 3 low 2000 non-null float64 4 close 2000 non-null float64 5 volume 2000 non-null int64 dtypes: datetime64[ns](1), float64(4), int64(1) memory usage: 93.9 KB
None
3. Meta Labeling Function Definition
The meta_labeling function processes OHLCV data to generate and filter trading signals using a two-stage approach. It first identifies primary signals (e.g., EMA crossovers) and then trains a meta-model to predict the profitability of these primary signals. The final signal is produced by gating the primary signal with the meta-model's confidence.
Core Logic
- Primary Signal Generation: Exponential Moving Averages (EMA) are calculated for fast and slow periods. A primary signal of +1 is generated when the fast EMA crosses above the slow EMA, and -1 when it crosses below.
- Meta Label Construction: A binary 'meta label' is created to indicate whether the primary signal was profitable over a specified future
horizon. A primary buy signal is profitable if the price increases, and a primary sell signal is profitable if the price decreases. - Meta Feature Engineering: Technical indicators such as returns, Relative Strength Index (RSI), Simple Moving Average (SMA) ratio, and volume ratio are calculated to serve as features for the meta-model.
- Meta Model Training: A
GradientBoostingClassifieris trained on these features, with the meta label as the target, to predict the probability of a primary signal being profitable. - Signal Alignment: The predicted meta probabilities are aligned back to the original DataFrame.
- Final Signal Generation: The primary signal is filtered based on the meta-model's predicted probability. Only signals with a meta probability exceeding
meta_thresholdare considered valid, resulting in a final signal that accounts for both direction and predicted profitability.
def meta_labeling(
df: pd.DataFrame,
ema_fast: int = 12,
ema_slow: int = 26,
horizon: int = 5,
meta_threshold: float = 0.55,
) -> pd.DataFrame:
"""
Applies meta labeling by training a secondary model to filter a primary EMA signal.
Parameters
----------
df : pd.DataFrame
OHLCV DataFrame with 'datetime', 'open', 'high', 'low', 'close', and 'volume' columns.
ema_fast : int
Period for the fast Exponential Moving Average (EMA) in the primary signal.
ema_slow : int
Period for the slow Exponential Moving Average (EMA) in the primary signal.
horizon : int
Number of bars to hold a trade for profitability calculation of the meta label.
meta_threshold : float
Minimum meta probability required for a primary signal to be accepted.
Returns
-------
pd.DataFrame
Original DataFrame with added columns: 'ema_fast', 'ema_slow', 'primary_signal',
'meta_label', 'return_1', 'return_5', 'rsi', 'sma_ratio', 'vol_ratio',
'meta_proba', and 'signal'.
"""
df = df.copy().sort_values("datetime", ignore_index=True)
# Primary signal (EMA cross)
df["ema_fast"] = df["close"].ewm(span=ema_fast, adjust=False).mean()
df["ema_slow"] = df["close"].ewm(span=ema_slow, adjust=False).mean()
df["primary_signal"] = np.where(df["ema_fast"] > df["ema_slow"], 1, -1)
# Meta label: was the primary signal profitable?
future_return = df["close"].shift(-horizon) / df["close"] - 1
# Profitable if primary direction matches forward return sign
df["meta_label"] = (
(df["primary_signal"] == 1) & (future_return > 0) |
(df["primary_signal"] == -1) & (future_return < 0)
).astype(int)
# Meta features
df["return_1"] = df["close"].pct_change(1)
df["return_5"] = df["close"].pct_change(5)
# Improved RSI calculation to handle zero division and NaNs more robustly
delta = df["close"].diff()
gain = delta.clip(lower=0)
loss = -delta.clip(upper=0)
avg_gain = gain.rolling(14).mean()
avg_loss = loss.rolling(14).mean().abs() # Ensure avg_loss is positive
# Calculate RS, handling division by zero with a small epsilon or assigning specific values
rs = avg_gain / avg_loss.replace(0, 1e-9) # Replace 0 with a small number to avoid div by zero
df["rsi"] = 100 - (100 / (1 + rs))
# Special cases: if no average loss, RSI is 100; if no average gain, RSI is 0
df.loc[avg_loss == 0, "rsi"] = 100
df.loc[avg_gain == 0, "rsi"] = 0
df["rsi"] = df["rsi"].fillna(50) # Fill any remaining NaNs (e.g., initial period) with a neutral 50
df["sma_ratio"] = df["close"].rolling(10).mean() / df["close"].rolling(20).mean()
df["vol_ratio"] = df["volume"] / df["volume"].rolling(20).mean()
feats = ["return_1", "return_5", "rsi", "sma_ratio", "vol_ratio"]
print(f"DEBUG_PRE_DROPNA: df shape: {df.shape}")
print(f"DEBUG_PRE_DROPNA: NaNs in feats and meta_label columns before dropna:")
for col in feats + ["meta_label"]:
print(f" {col}: {df[col].isna().sum()} NaNs")
df_clean = df.dropna(subset=feats + ["meta_label"]).copy()
X = df_clean[feats].values
y = df_clean["meta_label"].values
split = int(len(X) * 0.8)
print(f"DEBUG: Length of df_clean after dropna: {len(df_clean)}")
print(f"DEBUG: Length of X (features array): {len(X)}")
print(f"DEBUG: Calculated split point (80% for train): {split}")
print(f"DEBUG: Number of training samples (X[:split]): {len(X[:split]) if X is not None else 0}")
print(f"DEBUG: Number of testing samples (X[split:]): {len(X[split:]) if X is not None else 0}")
# Robustness check for insufficient data for both training and testing sets
num_samples_X = len(X)
num_train_samples = split
num_test_samples = num_samples_X - split
if num_samples_X == 0:
print(f"\nWarning: After dropping NaNs, the dataset for features is empty. Cannot train meta-model.")
df["meta_proba"] = np.nan
df["signal"] = 0
return df
if num_train_samples == 0:
print(f"\nWarning: The training set is empty after splitting ({num_train_samples} samples). Cannot train meta-model.")
df["meta_proba"] = np.nan
df["signal"] = 0
return df
if num_test_samples == 0:
print(f"\nWarning: The test set is empty after splitting ({num_test_samples} samples). Cannot evaluate meta-model.")
df["meta_proba"] = np.nan
df["signal"] = 0
return df
# Check for sufficient classes in the training set
unique_classes_train = np.unique(y[:split])
if len(unique_classes_train) < 2:
print(f"\nWarning: Training set for meta-model contains only 1 class (all {unique_classes_train[0]}s). Cannot train meta-model.")
df["meta_proba"] = np.nan
df["signal"] = 0
return df
print(f"DEBUG: Before StandardScaler fit_transform. X[:split].shape: {X[:split].shape}") # New debug line
sc = StandardScaler()
X_tr = sc.fit_transform(X[:split])
X_te = sc.transform(X[split:])
meta_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
meta_clf.fit(X_tr, y[:split])
meta_proba = meta_clf.predict_proba(X_te)[:, 1]
# Align meta probabilities back to DataFrame
test_idx = df_clean.index[split:]
df["meta_proba"] = np.nan
df.loc[test_idx, "meta_proba"] = meta_proba
# Final signal: primary direction, gated by meta confidence
df["signal"] = 0
df.loc[(df["meta_proba"] >= meta_threshold) & (df["primary_signal"] == 1), "signal"] = 1
df.loc[(df["meta_proba"] >= meta_threshold) & (df["primary_signal"] == -1), "signal"] = -1
print("\nClassification Report for Meta Model (Test Set):")
print(classification_report(y[split:], (meta_proba >= meta_threshold).astype(int)))
return df4. Applying Meta Labeling
Execute the meta_labeling function on the prepared DataFrame. This step generates the primary signals, meta labels, meta features, and the final filtered signals based on the meta-model's predictions. The distribution of the final signals is then displayed.
df_signals = meta_labeling(df, meta_threshold=0.55)
print("\n--- Final Signal Distribution ---")
print(df_signals["signal"].value_counts())DEBUG_PRE_DROPNA: df shape: (2000, 15) DEBUG_PRE_DROPNA: NaNs in feats and meta_label columns before dropna: return_1: 1 NaNs return_5: 5 NaNs rsi: 0 NaNs sma_ratio: 19 NaNs vol_ratio: 19 NaNs meta_label: 0 NaNs DEBUG: Length of df_clean after dropna: 1981 DEBUG: Length of X (features array): 1981 DEBUG: Calculated split point (80% for train): 1584 DEBUG: Number of training samples (X[:split]): 1584 DEBUG: Number of testing samples (X[split:]): 397 Warning: Training set for meta-model contains only 1 class (all 1s). Cannot train meta-model. --- Final Signal Distribution --- signal 0 2000 Name: count, dtype: int64
5. Interpretation of Meta Labels and Thresholds
- Meta Label Construction: The profitability of a primary signal is determined by actual forward returns. This ensures the meta model learns to differentiate between primary signals that historically led to profits versus losses.
meta_threshold: A parameter that defines the minimum confidence level (probability) from the meta model required for a primary signal to be acted upon. Ameta_thresholdof0.55indicates that only signals with at least a 55% predicted probability of being correct are considered valid. Higher values for this threshold typically reduce the number of generated signals but aim to improve the precision of the accepted signals.
6. Signal Visualization
This section visualizes the generated buy and sell signals in conjunction with the price data. The plotly library is used to create an interactive candlestick chart, overlaid with the meta-labeled buy (triangle-up) and sell (triangle-down) signals. A separate subplot displays the meta-model's probability (meta_proba), with a horizontal line indicating the meta_threshold.
buy_signals = df_signals[df_signals["signal"] == 1]
sell_signals = df_signals[df_signals["signal"] == -1]
fig = make_subplots(rows=2, cols=1, shared_xaxes=True,
subplot_titles=["Price + Meta-Labeled Signals", "Meta Probability"],
row_heights=[0.65, 0.35])
fig.add_trace(go.Candlestick(x=df_signals["datetime"],
open=df_signals["open"], high=df_signals["high"],
low=df_signals["low"], close=df_signals["close"], name="Price"), row=1, col=1)
fig.add_trace(go.Scatter(x=buy_signals["datetime"], y=buy_signals["low"] * 0.999,
mode="markers", marker=dict(symbol="triangle-up", size=10, color="green"), name="Buy"), row=1, col=1)
fig.add_trace(go.Scatter(x=sell_signals["datetime"], y=sell_signals["high"] * 1.001,
mode="markers", marker=dict(symbol="triangle-down", size=10, color="red"), name="Sell"), row=1, col=1)
fig.add_trace(go.Scatter(x=df_signals["datetime"], y=df_signals["meta_proba"],
mode="lines", name="Meta P(correct)", line=dict(color="purple")), row=2, col=1)
fig.add_hline(y=0.55, line_dash="dot", line_color="orange", row=2, col=1)
fig.update_layout(title_text="Meta Labeling Results",
xaxis_rangeslider_visible=False, height=700, xaxis2_title="Datetime",
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1))
fig.show()