Data·Analysis·Intermediate

Correlation Analysis

Compute rolling and static correlation matrices across assets and time periods to identify regime shifts and diversification opportunities.

correlationportfoliostatistics

Correlation Analysis Framework

This notebook defines a standardized protocol for measuring the statistical relationship between the price movements of multiple assets. It covers Pearson correlation, rolling correlation, and rank-based correlation on a representative dummy dataset containing three assets.


1. Dependency Installation

[1]
!pip install pandas numpy
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.12/dist-packages (2.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)

2. Library Imports

[2]
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

3. Key Concepts

Correlation Correlation measures whether two assets tend to move in the same direction, opposite directions, or independently of each other. It is expressed as a number between −1 and +1:

| Value | Meaning | |---| | +1.0 | Perfect positive correlation — when asset A rises, asset B always rises by the same proportion | | +0.5 | Moderate positive correlation — when A rises, B tends to rise but not always | | 0.0 | No correlation — the movements of A and B are unrelated | | −0.5 | Moderate negative correlation — when A rises, B tends to fall | | −1.0 | Perfect negative correlation — when A rises, B always falls by the same proportion |

In trading, correlation is used to measure diversification — a portfolio of two assets with +0.95 correlation offers almost no diversification benefit because both assets fall together in a downturn. A portfolio of assets with low or negative correlation is more stable because losses in one position are offset by gains in another.

Pearson Correlation Pearson correlation is the standard correlation measure. It computes the linear relationship between two series — specifically, how much the log returns of two assets move together relative to their own standard deviations. It assumes the relationship between the two assets is approximately straight-line (linear). It is the most widely used correlation measure in financial analysis.

Rolling Correlation A single correlation number computed over the entire history can be misleading — the relationship between two assets changes over time. Rolling correlation computes the Pearson correlation over a sliding window of the most recent N candles, producing a time series of correlation values. This reveals whether two assets are becoming more or less correlated over time, which is critical information for risk management.

Spearman Rank Correlation Spearman correlation measures whether two assets move in the same rank order rather than requiring a strictly linear relationship. Instead of using raw return values, it ranks the returns from smallest to largest and computes the correlation of the ranks. Spearman is more robust to extreme values (large spikes or crashes) that can distort Pearson correlation. A Spearman value close to Pearson indicates the relationship is well-described by a straight line; a large difference between the two suggests non-linear behavior.

Correlation Matrix A correlation matrix shows the pairwise correlation between every combination of assets in a single table. Each cell contains the correlation between the row asset and the column asset. The diagonal is always 1.0 (every asset has perfect correlation with itself). The matrix is symmetric — the correlation of A with B equals the correlation of B with A.

Log Returns for Correlation Correlation is always computed on log returns rather than raw prices. Raw prices of two assets that both trend upward over time will show high correlation simply because both numbers are increasing — this is a mathematical artifact, not a genuine relationship. Log returns remove the trend and isolate the period-by-period movements, which is the true signal of whether the assets respond to the same market conditions.


4. Dummy Dataset

[3]
raw_data = {
    "datetime": [
        "2024-01-01 00:00:00+00:00",
        "2024-01-01 00:01:00+00:00",
        "2024-01-01 00:02:00+00:00",
        "2024-01-01 00:03:00+00:00",
        "2024-01-01 00:04:00+00:00",
        "2024-01-01 00:05:00+00:00",
        "2024-01-01 00:06:00+00:00",
        "2024-01-01 00:07:00+00:00",
        "2024-01-01 00:08:00+00:00",
        "2024-01-01 00:09:00+00:00",
    ],
    # BTC — base asset
    "close_BTC":  [42200, 42150, 42300, 42250, 42400,
                   42350, 42500, 42450, 42600, 42550],
    # ETH — typically positively correlated with BTC
    "close_ETH":  [2200, 2190, 2210, 2205, 2220,
                   2215, 2230, 2225, 2240, 2235],
    # SOL — included as a third asset for matrix demonstration
    "close_SOL":  [95.0, 94.5, 95.5, 95.2, 95.8,
                   95.5, 96.0, 95.7, 96.3, 96.0],
}

df = pd.DataFrame(raw_data)
df["datetime"] = pd.to_datetime(df["datetime"], utc=True)
df = df.set_index("datetime")

print("--- Raw Close Prices ---")
display(df)
--- Raw Close Prices ---
close_BTC close_ETH close_SOL
datetime
2024-01-01 00:00:00+00:00 42200 2200 95.0
2024-01-01 00:01:00+00:00 42150 2190 94.5
2024-01-01 00:02:00+00:00 42300 2210 95.5
2024-01-01 00:03:00+00:00 42250 2205 95.2
2024-01-01 00:04:00+00:00 42400 2220 95.8
2024-01-01 00:05:00+00:00 42350 2215 95.5
2024-01-01 00:06:00+00:00 42500 2230 96.0
2024-01-01 00:07:00+00:00 42450 2225 95.7
2024-01-01 00:08:00+00:00 42600 2240 96.3
2024-01-01 00:09:00+00:00 42550 2235 96.0

Code Logic

  • Three close price series are included: BTC, ETH, and SOL. ETH is designed to move closely with BTC to demonstrate high positive correlation; SOL provides a third asset for the full matrix.
  • set_index("datetime"): The datetime column is set as the index so that pandas correlation and rolling operations align rows by time automatically.

5. Correlation Analysis Function

[4]
def analyze_correlation(df: pd.DataFrame, rolling_window: int = 5) -> dict:
    """
    Compute pairwise correlation metrics between asset close price series.

    Args:
        df             (pd.DataFrame): DataFrame with one close price column per asset,
                                       indexed by UTC datetime.
        rolling_window (int):          Number of candles for rolling correlation window.

    Returns:
        dict: Contains the following DataFrames:
              - "log_returns"         : Log return series for each asset.
              - "pearson_matrix"      : Full-history Pearson correlation matrix.
              - "spearman_matrix"     : Full-history Spearman rank correlation matrix.
              - "rolling_correlation" : Rolling Pearson correlation (first pair only).
    """
    # --- Log Returns ---
    log_returns = np.log(df / df.shift(1)).dropna()

    # --- Pearson Correlation Matrix ---
    pearson_matrix  = log_returns.corr(method="pearson")

    # --- Spearman Rank Correlation Matrix ---
    spearman_matrix = log_returns.corr(method="spearman")

    # --- Rolling Correlation (first two columns) ---
    cols = log_returns.columns.tolist()
    col_a, col_b = cols[0], cols[1]

    rolling_corr = (
        log_returns[col_a]
        .rolling(window=rolling_window)
        .corr(log_returns[col_b])
        .rename(f"rolling_corr_{col_a}_vs_{col_b}")
        .reset_index()
    )

    return {
        "log_returns":         log_returns.reset_index(),
        "pearson_matrix":      pearson_matrix,
        "spearman_matrix":     spearman_matrix,
        "rolling_correlation": rolling_corr,
    }

Code Logic

Log returns

  • np.log(df / df.shift(1)): Computes log returns for all asset columns simultaneously in one vectorized operation. df.shift(1) shifts all columns down by one row, so dividing produces the ratio of each price to its previous value. See Section 3 for the rationale for using log returns rather than raw prices.
  • .dropna(): Removes the first row, which contains NaN for all assets because no prior price exists for the ratio.

Pearson correlation matrix

  • log_returns.corr(method="pearson"): Computes the pairwise Pearson correlation between every combination of asset log return columns and returns the result as a symmetric matrix. See Section 3 for full definition.

Spearman rank correlation matrix

  • log_returns.corr(method="spearman"): Same structure as Pearson but operates on the rank order of returns rather than their raw values. A large difference between the Pearson and Spearman values for the same pair indicates the relationship is non-linear or driven by extreme outliers. See Section 3 for full definition.

Rolling correlation

  • log_returns[col_a].rolling(window=rolling_window).corr(log_returns[col_b]): Computes the Pearson correlation between two assets over a sliding window of rolling_window candles. At each point in time, only the most recent N log returns are used — the result reveals how the correlation between the two assets evolves over time rather than summarizing their entire history in a single number. See Section 3 for full definition.
  • cols[0], cols[1]: The first two asset columns are selected automatically, making the function generalize to any column naming convention without hardcoding asset names.

6. Execution

[5]
ROLLING_WINDOW = 5

results = analyze_correlation(df, rolling_window=ROLLING_WINDOW)

7. Output and Inspection

[8]
print("--- Log Returns ---")
display(results["log_returns"].round(6))

print("\n--- Pearson Correlation Matrix ---")
display(results["pearson_matrix"].round(4))

print("\n--- Spearman Rank Correlation Matrix ---")
display(results["spearman_matrix"].round(4))

print("\n--- Rolling Correlation (BTC vs ETH) ---")
display(results["rolling_correlation"].round(4))
--- Log Returns ---
datetime close_BTC close_ETH close_SOL
0 2024-01-01 00:01:00+00:00 -0.001186 -0.004556 -0.005277
1 2024-01-01 00:02:00+00:00 0.003552 0.009091 0.010526
2 2024-01-01 00:03:00+00:00 -0.001183 -0.002265 -0.003146
3 2024-01-01 00:04:00+00:00 0.003544 0.006780 0.006283
4 2024-01-01 00:05:00+00:00 -0.001180 -0.002255 -0.003136
5 2024-01-01 00:06:00+00:00 0.003536 0.006749 0.005222
6 2024-01-01 00:07:00+00:00 -0.001177 -0.002245 -0.003130
7 2024-01-01 00:08:00+00:00 0.003527 0.006719 0.006250
8 2024-01-01 00:09:00+00:00 -0.001174 -0.002235 -0.003120

--- Pearson Correlation Matrix ---
close_BTC close_ETH close_SOL
close_BTC 1.0000 0.9822 0.9624
close_ETH 0.9822 1.0000 0.9923
close_SOL 0.9624 0.9923 1.0000

--- Spearman Rank Correlation Matrix ---
close_BTC close_ETH close_SOL
close_BTC 1.0000 1.0000 0.9833
close_ETH 1.0000 1.0000 0.9833
close_SOL 0.9833 0.9833 1.0000

--- Rolling Correlation (BTC vs ETH) ---
datetime rolling_corr_close_BTC_vs_close_ETH
0 2024-01-01 00:01:00+00:00 NaN
1 2024-01-01 00:02:00+00:00 NaN
2 2024-01-01 00:03:00+00:00 NaN
3 2024-01-01 00:04:00+00:00 NaN
4 2024-01-01 00:05:00+00:00 0.9795
5 2024-01-01 00:06:00+00:00 0.9850
6 2024-01-01 00:07:00+00:00 1.0000
7 2024-01-01 00:08:00+00:00 1.0000
8 2024-01-01 00:09:00+00:00 1.0000

8. Interpretation Guide

[7]
def interpret_correlation(value: float) -> str:
    abs_val = abs(value)
    direction = "positive" if value > 0 else "negative"

    if abs_val >= 0.9:
        strength = "very strong"
    elif abs_val >= 0.7:
        strength = "strong"
    elif abs_val >= 0.5:
        strength = "moderate"
    elif abs_val >= 0.3:
        strength = "weak"
    else:
        strength = "negligible"

    return f"{strength} {direction} correlation ({value:.4f})"


matrix = results["pearson_matrix"]
cols   = matrix.columns.tolist()

print("--- Pairwise Interpretation ---")
print(f"BTC vs ETH : {interpret_correlation(matrix.loc[cols[0], cols[1]])}")
print(f"BTC vs SOL : {interpret_correlation(matrix.loc[cols[0], cols[2]])}")
print(f"ETH vs SOL : {interpret_correlation(matrix.loc[cols[1], cols[2]])}")

print("\n--- Schema Summary ---")
results["log_returns"].info()
--- Pairwise Interpretation ---
BTC vs ETH : very strong positive correlation (0.9822)
BTC vs SOL : very strong positive correlation (0.9624)
ETH vs SOL : very strong positive correlation (0.9923)

--- Schema Summary ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype              
---  ------     --------------  -----              
 0   datetime   9 non-null      datetime64[ns, UTC]
 1   close_BTC  9 non-null      float64            
 2   close_ETH  9 non-null      float64            
 3   close_SOL  9 non-null      float64            
dtypes: datetime64[ns, UTC](1), float64(3)
memory usage: 420.0 bytes

Code Logic

  • interpret_correlation: Translates a raw correlation coefficient into a plain-language strength label using standard thresholds widely adopted in statistical practice — ≥0.9 very strong, ≥0.7 strong, ≥0.5 moderate, ≥0.3 weak, below 0.3 negligible.
  • matrix.loc[cols[0], cols[1]]: Retrieves the pairwise correlation value from the matrix using column names — more robust than integer indexing because column order is preserved from the input DataFrame.
[ ]