Data·Cleaning·Beginner

Handle Missing Data

Detect, classify, and fill gaps in price data — forward-fill, interpolation, and gap-flagging strategies for OHLCV series.

missing datagapsimputation

Missing Value Handling Framework

This notebook defines a standardized protocol for detecting and resolving missing values in OHLCV time-series data. It covers null detection, timestamp-to-datetime conversion, forward-fill imputation for price columns, and volume-specific zero-fill strategy on a representative dummy dataset.


1. Dependency Installation

[1]
!pip install pandas
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Requirement already satisfied: numpy>=1.26.0 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)

2. Library Imports

[2]
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

Code Logic

  • pandas: Provides the DataFrame structure and null-handling operations (fillna, ffill, bfill, isna).
  • numpy: Supplies np.nan for explicit null injection into the dummy dataset.

3. Missing Data in OHLCV Context

Missing values in financial time-series data arise from two distinct causes, each requiring a different resolution strategy:

CauseExampleCorrect Resolution
No trades in the periodLow-liquidity asset at 3 AMCarry forward last known price; set volume to zero
Transmission failureAPI timeout, partial responseImpute from adjacent candles
Corrupt placeholderExchange writes NaN for a bad tickImpute or drop depending on surrounding context

Dropping all rows with nulls is too aggressive — it removes valid candle windows that simply had no trading activity. Imputation strategies must be applied selectively by column type.


4. Imputation Strategy Definitions

Forward Fill (ffill) Forward fill propagates the most recent non-null value forward in time to fill subsequent null cells. In OHLCV context: if the price at 00:03 is unknown, the last known price at 00:02 is carried forward. This is the standard convention in financial data pipelines because, in the absence of any trade, the last traded price is the best available estimate of the current price.

Backward Fill (bfill) Backward fill propagates the next available non-null value backward in time to fill preceding null cells. Applied only to resolve leading nulls at the very start of a series where no prior value exists for forward propagation. In practice, bfill is a last-resort fallback applied after ffill has already resolved the majority of missing values.

Zero Fill Volume is not a price — it measures quantity of trades executed. A missing volume record indicates that no trades occurred in that candle window, not that the data was lost in transmission. Carrying forward the previous volume would incorrectly imply trading activity. Zero fill is therefore the correct and standard resolution for missing volume.

Linear Interpolation Linear interpolation estimates a missing value by computing a straight line between the two nearest known values and placing the missing point on that line. For example, if the price at 00:02 is 42200 and at 00:04 is 42400, interpolation places 00:03 at 42300. This is appropriate for slowly moving continuous signals but is generally avoided for OHLCV data because price does not move linearly between candles — markets jump. Forward fill is preferred because it makes no assumption about the direction of price movement.


5. Dummy Dataset

[3]
raw_data = {
    "timestamp": [
        1704067200000,
        1704067260000,
        1704067320000,
        1704067380000,
        1704067440000,
        1704067500000,
    ],
    "open":   [42100.0, np.nan,  42200.0, np.nan,  42500.0, 42550.0],
    "high":   [42300.0, np.nan,  42400.0, 42350.0, 42700.0, 42800.0],
    "low":    [41900.0, np.nan,  42100.0, 42000.0, 42400.0, 42450.0],
    "close":  [42200.0, np.nan,  42350.0, 42300.0, 42600.0, 42700.0],
    "volume": [10.5,    np.nan,  8.2,     np.nan,  12.3,    9.1    ],
}

df_raw = pd.DataFrame(raw_data)

print("--- Raw OHLCV Data (with nulls) ---")
display(df_raw)

print("\n--- Null Count Per Column ---")
print(df_raw.isnull().sum())
--- Raw OHLCV Data (with nulls) ---
timestamp open high low close volume
0 1704067200000 42100.0 42300.0 41900.0 42200.0 10.5
1 1704067260000 NaN NaN NaN NaN NaN
2 1704067320000 42200.0 42400.0 42100.0 42350.0 8.2
3 1704067380000 NaN 42350.0 42000.0 42300.0 NaN
4 1704067440000 42500.0 42700.0 42400.0 42600.0 12.3
5 1704067500000 42550.0 42800.0 42450.0 42700.0 9.1

--- Null Count Per Column ---
timestamp    0
open         2
high         1
low          1
close        1
volume       2
dtype: int64

Code Logic

  • Row 1 contains fully null OHLCV values, simulating a missing candle period with no recorded trades.
  • Row 3 contains partial nulls (open, volume) to simulate a partially transmitted API response.
  • df_raw.isnull().sum(): Produces a per-column null count audit prior to imputation.

6. Missing Value Handler Function

[4]
def handle_missing(df: pd.DataFrame) -> pd.DataFrame:

    # --- Timestamp → Datetime Conversion ---
    df["datetime"] = pd.to_datetime(df["timestamp"], unit="ms", utc=True)
    df = df.drop(columns=["timestamp"])

    # --- Sort Chronologically ---
    df = df.sort_values("datetime", ignore_index=True)

    # --- Forward Fill: Price Columns ---
    price_cols = ["open", "high", "low", "close"]
    df[price_cols] = df[price_cols].ffill()

    # --- Backward Fill: Resolve Leading Nulls ---
    df[price_cols] = df[price_cols].bfill()

    # --- Zero Fill: Volume ---
    df["volume"] = df["volume"].fillna(0.0)

    # --- Final Null Safety Drop ---
    df = df.dropna(subset=["open", "high", "low", "close", "volume"])

    return df[["datetime", "open", "high", "low", "close", "volume"]]

Code Logic

Timestamp to datetime

  • pd.to_datetime(df["timestamp"], unit="ms", utc=True): Converts Unix millisecond integers to UTC-aware datetime objects. Chronological sorting — which forward fill depends on — operates correctly on datetime objects but produces unreliable results if applied to raw integer timestamps that happen to be unsorted.
  • df.drop(columns=["timestamp"]): Removes the raw integer column after conversion.

Sort

  • sort_values("datetime"): Chronological ordering is a hard requirement before any fill operation. ffill propagates values in row order — if the DataFrame is unsorted, it will propagate values in the wrong temporal direction.

Forward fill — price columns

  • df[price_cols].ffill(): Propagates the most recent valid price forward into null cells. See Section 4 for full definition. Applied first because it resolves the majority of missing values in a typical OHLCV series.

Backward fill — leading nulls

  • df[price_cols].bfill(): Applied after ffill exclusively to resolve nulls at the very start of the series where no prior row exists for forward propagation. See Section 4 for full definition.

Zero fill — volume

  • df["volume"].fillna(0.0): Missing volume is imputed as zero. See Section 4 for full definition and rationale.

Final safety drop

  • dropna(subset=[...]): Removes any row where imputation was not possible — for example, an entirely null series slice with no adjacent values to fill from.

7. Execution

[5]
df_filled = handle_missing(df_raw)

print("--- Imputed OHLCV Data ---")
display(df_filled)

print("\n--- Null Count After Imputation ---")
print(df_filled.isnull().sum())

print("\n--- Schema Summary ---")
df_filled.info()
--- Imputed OHLCV Data ---
datetime open high low close volume
0 2024-01-01 00:00:00+00:00 42100.0 42300.0 41900.0 42200.0 10.5
1 2024-01-01 00:01:00+00:00 42100.0 42300.0 41900.0 42200.0 0.0
2 2024-01-01 00:02:00+00:00 42200.0 42400.0 42100.0 42350.0 8.2
3 2024-01-01 00:03:00+00:00 42200.0 42350.0 42000.0 42300.0 0.0
4 2024-01-01 00:04:00+00:00 42500.0 42700.0 42400.0 42600.0 12.3
5 2024-01-01 00:05:00+00:00 42550.0 42800.0 42450.0 42700.0 9.1

--- Null Count After Imputation ---
datetime    0
open        0
high        0
low         0
close       0
volume      0
dtype: int64

--- Schema Summary ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype              
---  ------    --------------  -----              
 0   datetime  6 non-null      datetime64[ns, UTC]
 1   open      6 non-null      float64            
 2   high      6 non-null      float64            
 3   low       6 non-null      float64            
 4   close     6 non-null      float64            
 5   volume    6 non-null      float64            
dtypes: datetime64[ns, UTC](1), float64(5)
memory usage: 420.0 bytes
[ ]