Notebooks/ML Feature Signal Fusion
Signals·Combined Signals·Advanced

ML Feature Signal Fusion

Combine technical indicator features with ML model outputs into a unified signal score using weighted ensemble methods.

MLfusionensemble

ML Feature + Signal Fusion

The ML + Signal Fusion strategy combines the probabilistic output of a trained machine learning classifier with a traditional technical signal. This approach requires both components to indicate agreement above a predefined confidence threshold before generating a trading signal.

Fusion Logic

  • ML Model Output: The machine learning model produces a probability estimate, P(up), representing the likelihood of an upward price movement.
  • Technical Signal: A traditional technical indicator, such as an Exponential Moving Average (EMA) cross, provides a binary signal (+1 for bullish, -1 for bearish, 0 for neutral).
  • Fused Signal Generation: A signal is emitted only when the ML probability (P(up) or P(down)) exceeds a specified ml_threshold AND the technical signal aligns with the ML prediction.

Advantages

  • The machine learning layer effectively captures complex, non-linear patterns within features that might be missed by traditional methods.
  • The technical analysis layer provides interpretable structural context, enhancing the reliability and transparency of the signal.
  • Neither component is solely relied upon, offering a robust, multi-faceted decision-making process.

Limitations

  • The current model is trained and evaluated on the same dataset (specifically, a single 500-bar period), which does not constitute a true out-of-sample assessment.
  • For production environments, implementing robust validation methodologies, such as walk-forward optimization or purged cross-validation, is essential to ensure generalization and prevent overfitting.
[1]
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import plotly.graph_objects as go
from plotly.subplots import make_subplots

1. Data Generation

Synthetic OHLCV price data is generated using a geometric random walk for demonstration and model evaluation. This process simulates market data with defined characteristics, suitable for testing trading strategies or machine learning models.

[2]
def generate_data(periods: int) -> pd.DataFrame:
    """
    Generates synthetic OHLCV price data using a geometric random walk.

    Parameters
    ----------
    periods : int
        Number of 1-minute bars to generate.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns: open, high, low, close, volume, datetime.
    """
    start_date     = pd.to_datetime("2024-01-01 00:00:00+00:00")
    datetime_index = pd.date_range(start_date, periods=periods, freq="1min", tz="UTC")
    price_data = []
    last_close = 42000
    for i in range(periods):
        open_price  = last_close + np.random.normal(0, last_close * 0.0005)
        close_price = open_price + np.random.normal(0, last_close * 0.005)
        body_high   = max(open_price, close_price)
        body_low    = min(open_price, close_price)
        high_price  = max(body_high + abs(np.random.normal(0, last_close * 0.002)), open_price, close_price)
        low_price   = min(body_low  - abs(np.random.normal(0, last_close * 0.002)), open_price, close_price)
        if high_price < low_price:
            high_price, low_price = low_price, high_price
        price_data.append({
            "open":  max(1, int(open_price)),
            "high":  max(1, int(high_price)),
            "low":   max(1, int(low_price)),
            "close": max(1, int(close_price)),
        })
        last_close = close_price
    df = pd.DataFrame(price_data, index=datetime_index)
    df.index.name = "datetime"
    df["volume"]   = np.random.uniform(100.0, 500.0, periods)
    df["datetime"] = df.index.to_series()
    return df.reset_index(drop=True)

df = generate_data(500)
display(df.head())
open high low close volume datetime
0 42018 42191 41924 42160 207.092540 2024-01-01 00:00:00+00:00
1 42161 42299 42022 42114 342.810455 2024-01-01 00:01:00+00:00
2 42132 42349 41946 42230 196.538267 2024-01-01 00:02:00+00:00
3 42220 42225 42073 42086 341.995415 2024-01-01 00:03:00+00:00
4 42098 42428 42049 42395 106.497504 2024-01-01 00:04:00+00:00

2. ML Feature and Signal Fusion Implementation

This section defines the ml_feature_signal_fusion function, which integrates machine learning probabilities with traditional technical analysis signals. The function engineers technical features, trains a Random Forest Classifier, generates out-of-sample probability predictions, computes EMA crossover signals, and finally fuses these two components to produce a consolidated trading signal.

[3]
def ml_feature_signal_fusion(
    df: pd.DataFrame,
    horizon: int = 5,
    ml_threshold: float = 0.6,
    ema_fast: int = 12,
    ema_slow: int = 26,
) -> pd.DataFrame:
    """
    Fuses ML classification probabilities with EMA cross signals.

    Core Logic
    ----------
    1. Engineers technical features and binary labels.
    2. Trains a RandomForestClassifier on the first 80%% of bars.
    3. Generates out-of-sample probability predictions on the remaining 20%%.
    4. Computes EMA cross signal on all bars.
    5. Fuses: emits +1 when P(up) >= ml_threshold AND EMA bullish;
             emits -1 when P(down) >= ml_threshold AND EMA bearish.

    Parameters
    ----------
    df : pd.DataFrame
        OHLCV DataFrame.
    horizon : int
        Forward bars for label computation.
    ml_threshold : float
        Minimum ML probability to qualify for fusion.
    ema_fast : int
        Fast EMA period for TA signal.
    ema_slow : int
        Slow EMA period for TA signal.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns: ml_proba, ema_signal, signal.
    """
    df = df.copy().sort_values("datetime", ignore_index=True)

    # ── Feature Engineering ───────────────────────────────────────────────────
    df["return_1"]  = df["close"].pct_change(1)
    df["return_5"]  = df["close"].pct_change(5)
    delta = df["close"].diff()
    gain  = delta.clip(lower=0).rolling(14).mean()
    loss  = (-delta.clip(upper=0)).rolling(14).mean()
    df["rsi"]       = 100 - 100 / (1 + gain / loss.replace(0, np.nan))
    df["sma_ratio"] = df["close"].rolling(10).mean() / df["close"].rolling(20).mean()
    df["label"]     = (df["close"].shift(-horizon) > df["close"]).astype(int)

    feats = ["return_1", "return_5", "rsi", "sma_ratio"]
    df_clean = df.dropna(subset=feats + ["label"]).copy()

    X = df_clean[feats].values
    y = df_clean["label"].values
    split = int(len(X) * 0.8)

    sc = StandardScaler()
    X_tr = sc.fit_transform(X[:split])
    X_te = sc.transform(X[split:])

    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_tr, y[:split])

    proba = clf.predict_proba(X_te)[:, 1]   # P(up)

    # Align probabilities back to the test portion of the main DataFrame
    test_idx = df_clean.index[split:]
    df["ml_proba"] = np.nan
    df.loc[test_idx, "ml_proba"] = proba

    # ── EMA Cross Signal ──────────────────────────────────────────────────────
    df["ema_fast"]  = df["close"].ewm(span=ema_fast, adjust=False).mean()
    df["ema_slow"]  = df["close"].ewm(span=ema_slow, adjust=False).mean()
    df["ema_signal"] = np.where(df["ema_fast"] > df["ema_slow"], 1, -1)

    # ── Fusion ────────────────────────────────────────────────────────────────
    df["signal"] = 0
    df.loc[(df["ml_proba"] >= ml_threshold)       & (df["ema_signal"] == 1),  "signal"] =  1
    df.loc[(1 - df["ml_proba"] >= ml_threshold)   & (df["ema_signal"] == -1), "signal"] = -1

    return df

ML Probability Prediction and Fusion Threshold

  • clf.predict_proba(X_te)[:, 1]: Retrieves the estimated probability of the positive class (upward movement). Class index 1 corresponds to the 'up' label after scikit-learn's alphabetical sorting of classes.
  • Fusion Threshold (ml_threshold): Requires the ML model to express a minimum confidence level (e.g., 60%) for a signal to be considered valid. Lower thresholds increase signal recall but may reduce precision.

3. Visualization of Fused Signals

This section visualizes the generated OHLCV data alongside the fused buy and sell signals. It also displays the ML probability estimates, allowing for a clear understanding of the model's confidence and how it interacts with the EMA signals to generate trade decisions.

[4]
df_signals = ml_feature_signal_fusion(df, ml_threshold=0.6)
print("Fused Signal Counts:\n", df_signals["signal"].value_counts())

buy_signals  = df_signals[df_signals["signal"] ==  1]
sell_signals = df_signals[df_signals["signal"] == -1]

fig = make_subplots(rows=2, cols=1, shared_xaxes=True,
    subplot_titles=["Price + ML-TA Fused Signals", "ML Probability"],
    row_heights=[0.65, 0.35])

fig.add_trace(go.Candlestick(x=df_signals["datetime"],
    open=df_signals["open"], high=df_signals["high"],
    low=df_signals["low"], close=df_signals["close"], name="Price"), row=1, col=1)

fig.add_trace(go.Scatter(x=buy_signals["datetime"],  y=buy_signals["low"]  * 0.999,
    mode="markers", marker=dict(symbol="triangle-up",   size=10, color="green"), name="Buy"),  row=1, col=1)

fig.add_trace(go.Scatter(x=sell_signals["datetime"], y=sell_signals["high"] * 1.001,
    mode="markers", marker=dict(symbol="triangle-down", size=10, color="red"),   name="Sell"), row=1, col=1)

fig.add_trace(go.Scatter(x=df_signals["datetime"], y=df_signals["ml_proba"],
    mode="lines", name="P(Up)", line=dict(color="blue")), row=2, col=1)

fig.add_hline(y=0.6, line_dash="dot", line_color="green", row=2, col=1)
fig.add_hline(y=0.4, line_dash="dot", line_color="red",   row=2, col=1)

fig.update_layout(title_text="ML Feature + Signal Fusion",
    xaxis_rangeslider_visible=False, height=700, xaxis2_title="Datetime")

fig.show()
Fused Signal Counts:
 signal
 0    457
 1     42
-1      1
Name: count, dtype: int64