ML Feature Signal Fusion
Combine technical indicator features with ML model outputs into a unified signal score using weighted ensemble methods.
ML Feature + Signal Fusion
The ML + Signal Fusion strategy combines the probabilistic output of a trained machine learning classifier with a traditional technical signal. This approach requires both components to indicate agreement above a predefined confidence threshold before generating a trading signal.
Fusion Logic
- ML Model Output: The machine learning model produces a probability estimate, P(up), representing the likelihood of an upward price movement.
- Technical Signal: A traditional technical indicator, such as an Exponential Moving Average (EMA) cross, provides a binary signal (+1 for bullish, -1 for bearish, 0 for neutral).
- Fused Signal Generation: A signal is emitted only when the ML probability (P(up) or P(down)) exceeds a specified
ml_thresholdAND the technical signal aligns with the ML prediction.
Advantages
- The machine learning layer effectively captures complex, non-linear patterns within features that might be missed by traditional methods.
- The technical analysis layer provides interpretable structural context, enhancing the reliability and transparency of the signal.
- Neither component is solely relied upon, offering a robust, multi-faceted decision-making process.
Limitations
- The current model is trained and evaluated on the same dataset (specifically, a single 500-bar period), which does not constitute a true out-of-sample assessment.
- For production environments, implementing robust validation methodologies, such as walk-forward optimization or purged cross-validation, is essential to ensure generalization and prevent overfitting.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import plotly.graph_objects as go
from plotly.subplots import make_subplots1. Data Generation
Synthetic OHLCV price data is generated using a geometric random walk for demonstration and model evaluation. This process simulates market data with defined characteristics, suitable for testing trading strategies or machine learning models.
def generate_data(periods: int) -> pd.DataFrame:
"""
Generates synthetic OHLCV price data using a geometric random walk.
Parameters
----------
periods : int
Number of 1-minute bars to generate.
Returns
-------
pd.DataFrame
DataFrame with columns: open, high, low, close, volume, datetime.
"""
start_date = pd.to_datetime("2024-01-01 00:00:00+00:00")
datetime_index = pd.date_range(start_date, periods=periods, freq="1min", tz="UTC")
price_data = []
last_close = 42000
for i in range(periods):
open_price = last_close + np.random.normal(0, last_close * 0.0005)
close_price = open_price + np.random.normal(0, last_close * 0.005)
body_high = max(open_price, close_price)
body_low = min(open_price, close_price)
high_price = max(body_high + abs(np.random.normal(0, last_close * 0.002)), open_price, close_price)
low_price = min(body_low - abs(np.random.normal(0, last_close * 0.002)), open_price, close_price)
if high_price < low_price:
high_price, low_price = low_price, high_price
price_data.append({
"open": max(1, int(open_price)),
"high": max(1, int(high_price)),
"low": max(1, int(low_price)),
"close": max(1, int(close_price)),
})
last_close = close_price
df = pd.DataFrame(price_data, index=datetime_index)
df.index.name = "datetime"
df["volume"] = np.random.uniform(100.0, 500.0, periods)
df["datetime"] = df.index.to_series()
return df.reset_index(drop=True)
df = generate_data(500)
display(df.head())| open | high | low | close | volume | datetime | |
|---|---|---|---|---|---|---|
| 0 | 42018 | 42191 | 41924 | 42160 | 207.092540 | 2024-01-01 00:00:00+00:00 |
| 1 | 42161 | 42299 | 42022 | 42114 | 342.810455 | 2024-01-01 00:01:00+00:00 |
| 2 | 42132 | 42349 | 41946 | 42230 | 196.538267 | 2024-01-01 00:02:00+00:00 |
| 3 | 42220 | 42225 | 42073 | 42086 | 341.995415 | 2024-01-01 00:03:00+00:00 |
| 4 | 42098 | 42428 | 42049 | 42395 | 106.497504 | 2024-01-01 00:04:00+00:00 |
2. ML Feature and Signal Fusion Implementation
This section defines the ml_feature_signal_fusion function, which integrates machine learning probabilities with traditional technical analysis signals. The function engineers technical features, trains a Random Forest Classifier, generates out-of-sample probability predictions, computes EMA crossover signals, and finally fuses these two components to produce a consolidated trading signal.
def ml_feature_signal_fusion(
df: pd.DataFrame,
horizon: int = 5,
ml_threshold: float = 0.6,
ema_fast: int = 12,
ema_slow: int = 26,
) -> pd.DataFrame:
"""
Fuses ML classification probabilities with EMA cross signals.
Core Logic
----------
1. Engineers technical features and binary labels.
2. Trains a RandomForestClassifier on the first 80%% of bars.
3. Generates out-of-sample probability predictions on the remaining 20%%.
4. Computes EMA cross signal on all bars.
5. Fuses: emits +1 when P(up) >= ml_threshold AND EMA bullish;
emits -1 when P(down) >= ml_threshold AND EMA bearish.
Parameters
----------
df : pd.DataFrame
OHLCV DataFrame.
horizon : int
Forward bars for label computation.
ml_threshold : float
Minimum ML probability to qualify for fusion.
ema_fast : int
Fast EMA period for TA signal.
ema_slow : int
Slow EMA period for TA signal.
Returns
-------
pd.DataFrame
DataFrame with columns: ml_proba, ema_signal, signal.
"""
df = df.copy().sort_values("datetime", ignore_index=True)
# ── Feature Engineering ───────────────────────────────────────────────────
df["return_1"] = df["close"].pct_change(1)
df["return_5"] = df["close"].pct_change(5)
delta = df["close"].diff()
gain = delta.clip(lower=0).rolling(14).mean()
loss = (-delta.clip(upper=0)).rolling(14).mean()
df["rsi"] = 100 - 100 / (1 + gain / loss.replace(0, np.nan))
df["sma_ratio"] = df["close"].rolling(10).mean() / df["close"].rolling(20).mean()
df["label"] = (df["close"].shift(-horizon) > df["close"]).astype(int)
feats = ["return_1", "return_5", "rsi", "sma_ratio"]
df_clean = df.dropna(subset=feats + ["label"]).copy()
X = df_clean[feats].values
y = df_clean["label"].values
split = int(len(X) * 0.8)
sc = StandardScaler()
X_tr = sc.fit_transform(X[:split])
X_te = sc.transform(X[split:])
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_tr, y[:split])
proba = clf.predict_proba(X_te)[:, 1] # P(up)
# Align probabilities back to the test portion of the main DataFrame
test_idx = df_clean.index[split:]
df["ml_proba"] = np.nan
df.loc[test_idx, "ml_proba"] = proba
# ── EMA Cross Signal ──────────────────────────────────────────────────────
df["ema_fast"] = df["close"].ewm(span=ema_fast, adjust=False).mean()
df["ema_slow"] = df["close"].ewm(span=ema_slow, adjust=False).mean()
df["ema_signal"] = np.where(df["ema_fast"] > df["ema_slow"], 1, -1)
# ── Fusion ────────────────────────────────────────────────────────────────
df["signal"] = 0
df.loc[(df["ml_proba"] >= ml_threshold) & (df["ema_signal"] == 1), "signal"] = 1
df.loc[(1 - df["ml_proba"] >= ml_threshold) & (df["ema_signal"] == -1), "signal"] = -1
return dfML Probability Prediction and Fusion Threshold
clf.predict_proba(X_te)[:, 1]: Retrieves the estimated probability of the positive class (upward movement). Class index 1 corresponds to the 'up' label after scikit-learn's alphabetical sorting of classes.- Fusion Threshold (
ml_threshold): Requires the ML model to express a minimum confidence level (e.g., 60%) for a signal to be considered valid. Lower thresholds increase signal recall but may reduce precision.
3. Visualization of Fused Signals
This section visualizes the generated OHLCV data alongside the fused buy and sell signals. It also displays the ML probability estimates, allowing for a clear understanding of the model's confidence and how it interacts with the EMA signals to generate trade decisions.
df_signals = ml_feature_signal_fusion(df, ml_threshold=0.6)
print("Fused Signal Counts:\n", df_signals["signal"].value_counts())
buy_signals = df_signals[df_signals["signal"] == 1]
sell_signals = df_signals[df_signals["signal"] == -1]
fig = make_subplots(rows=2, cols=1, shared_xaxes=True,
subplot_titles=["Price + ML-TA Fused Signals", "ML Probability"],
row_heights=[0.65, 0.35])
fig.add_trace(go.Candlestick(x=df_signals["datetime"],
open=df_signals["open"], high=df_signals["high"],
low=df_signals["low"], close=df_signals["close"], name="Price"), row=1, col=1)
fig.add_trace(go.Scatter(x=buy_signals["datetime"], y=buy_signals["low"] * 0.999,
mode="markers", marker=dict(symbol="triangle-up", size=10, color="green"), name="Buy"), row=1, col=1)
fig.add_trace(go.Scatter(x=sell_signals["datetime"], y=sell_signals["high"] * 1.001,
mode="markers", marker=dict(symbol="triangle-down", size=10, color="red"), name="Sell"), row=1, col=1)
fig.add_trace(go.Scatter(x=df_signals["datetime"], y=df_signals["ml_proba"],
mode="lines", name="P(Up)", line=dict(color="blue")), row=2, col=1)
fig.add_hline(y=0.6, line_dash="dot", line_color="green", row=2, col=1)
fig.add_hline(y=0.4, line_dash="dot", line_color="red", row=2, col=1)
fig.update_layout(title_text="ML Feature + Signal Fusion",
xaxis_rangeslider_visible=False, height=700, xaxis2_title="Datetime")
fig.show()Fused Signal Counts: signal 0 457 1 42 -1 1 Name: count, dtype: int64