Handle Missing Data
Detect, classify, and fill gaps in price data — forward-fill, interpolation, and gap-flagging strategies for OHLCV series.
Missing Value Handling Framework
This notebook defines a standardized protocol for detecting and resolving missing values in OHLCV time-series data. It covers null detection, timestamp-to-datetime conversion, forward-fill imputation for price columns, and volume-specific zero-fill strategy on a representative dummy dataset.
1. Dependency Installation
!pip install pandasRequirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2) Requirement already satisfied: numpy>=1.26.0 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.0.2) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
2. Library Imports
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as npCode Logic
pandas: Provides the DataFrame structure and null-handling operations (fillna,ffill,bfill,isna).numpy: Suppliesnp.nanfor explicit null injection into the dummy dataset.
3. Missing Data in OHLCV Context
Missing values in financial time-series data arise from two distinct causes, each requiring a different resolution strategy:
| Cause | Example | Correct Resolution |
|---|---|---|
| No trades in the period | Low-liquidity asset at 3 AM | Carry forward last known price; set volume to zero |
| Transmission failure | API timeout, partial response | Impute from adjacent candles |
| Corrupt placeholder | Exchange writes NaN for a bad tick | Impute or drop depending on surrounding context |
Dropping all rows with nulls is too aggressive — it removes valid candle windows that simply had no trading activity. Imputation strategies must be applied selectively by column type.
4. Imputation Strategy Definitions
Forward Fill (ffill)
Forward fill propagates the most recent non-null value forward in time
to fill subsequent null cells. In OHLCV context: if the price at 00:03
is unknown, the last known price at 00:02 is carried forward. This is
the standard convention in financial data pipelines because, in the
absence of any trade, the last traded price is the best available
estimate of the current price.
Backward Fill (bfill)
Backward fill propagates the next available non-null value backward in
time to fill preceding null cells. Applied only to resolve leading nulls
at the very start of a series where no prior value exists for forward
propagation. In practice, bfill is a last-resort fallback applied
after ffill has already resolved the majority of missing values.
Zero Fill Volume is not a price — it measures quantity of trades executed. A missing volume record indicates that no trades occurred in that candle window, not that the data was lost in transmission. Carrying forward the previous volume would incorrectly imply trading activity. Zero fill is therefore the correct and standard resolution for missing volume.
Linear Interpolation Linear interpolation estimates a missing value by computing a straight line between the two nearest known values and placing the missing point on that line. For example, if the price at 00:02 is 42200 and at 00:04 is 42400, interpolation places 00:03 at 42300. This is appropriate for slowly moving continuous signals but is generally avoided for OHLCV data because price does not move linearly between candles — markets jump. Forward fill is preferred because it makes no assumption about the direction of price movement.
5. Dummy Dataset
raw_data = {
"timestamp": [
1704067200000,
1704067260000,
1704067320000,
1704067380000,
1704067440000,
1704067500000,
],
"open": [42100.0, np.nan, 42200.0, np.nan, 42500.0, 42550.0],
"high": [42300.0, np.nan, 42400.0, 42350.0, 42700.0, 42800.0],
"low": [41900.0, np.nan, 42100.0, 42000.0, 42400.0, 42450.0],
"close": [42200.0, np.nan, 42350.0, 42300.0, 42600.0, 42700.0],
"volume": [10.5, np.nan, 8.2, np.nan, 12.3, 9.1 ],
}
df_raw = pd.DataFrame(raw_data)
print("--- Raw OHLCV Data (with nulls) ---")
display(df_raw)
print("\n--- Null Count Per Column ---")
print(df_raw.isnull().sum())--- Raw OHLCV Data (with nulls) ---
| timestamp | open | high | low | close | volume | |
|---|---|---|---|---|---|---|
| 0 | 1704067200000 | 42100.0 | 42300.0 | 41900.0 | 42200.0 | 10.5 |
| 1 | 1704067260000 | NaN | NaN | NaN | NaN | NaN |
| 2 | 1704067320000 | 42200.0 | 42400.0 | 42100.0 | 42350.0 | 8.2 |
| 3 | 1704067380000 | NaN | 42350.0 | 42000.0 | 42300.0 | NaN |
| 4 | 1704067440000 | 42500.0 | 42700.0 | 42400.0 | 42600.0 | 12.3 |
| 5 | 1704067500000 | 42550.0 | 42800.0 | 42450.0 | 42700.0 | 9.1 |
--- Null Count Per Column --- timestamp 0 open 2 high 1 low 1 close 1 volume 2 dtype: int64
Code Logic
- Row 1 contains fully null OHLCV values, simulating a missing candle period with no recorded trades.
- Row 3 contains partial nulls (
open,volume) to simulate a partially transmitted API response. df_raw.isnull().sum(): Produces a per-column null count audit prior to imputation.
6. Missing Value Handler Function
def handle_missing(df: pd.DataFrame) -> pd.DataFrame:
# --- Timestamp → Datetime Conversion ---
df["datetime"] = pd.to_datetime(df["timestamp"], unit="ms", utc=True)
df = df.drop(columns=["timestamp"])
# --- Sort Chronologically ---
df = df.sort_values("datetime", ignore_index=True)
# --- Forward Fill: Price Columns ---
price_cols = ["open", "high", "low", "close"]
df[price_cols] = df[price_cols].ffill()
# --- Backward Fill: Resolve Leading Nulls ---
df[price_cols] = df[price_cols].bfill()
# --- Zero Fill: Volume ---
df["volume"] = df["volume"].fillna(0.0)
# --- Final Null Safety Drop ---
df = df.dropna(subset=["open", "high", "low", "close", "volume"])
return df[["datetime", "open", "high", "low", "close", "volume"]]Code Logic
Timestamp to datetime
pd.to_datetime(df["timestamp"], unit="ms", utc=True): Converts Unix millisecond integers to UTC-aware datetime objects. Chronological sorting — which forward fill depends on — operates correctly on datetime objects but produces unreliable results if applied to raw integer timestamps that happen to be unsorted.df.drop(columns=["timestamp"]): Removes the raw integer column after conversion.
Sort
sort_values("datetime"): Chronological ordering is a hard requirement before any fill operation.ffillpropagates values in row order — if the DataFrame is unsorted, it will propagate values in the wrong temporal direction.
Forward fill — price columns
df[price_cols].ffill(): Propagates the most recent valid price forward into null cells. See Section 4 for full definition. Applied first because it resolves the majority of missing values in a typical OHLCV series.
Backward fill — leading nulls
df[price_cols].bfill(): Applied afterffillexclusively to resolve nulls at the very start of the series where no prior row exists for forward propagation. See Section 4 for full definition.
Zero fill — volume
df["volume"].fillna(0.0): Missing volume is imputed as zero. See Section 4 for full definition and rationale.
Final safety drop
dropna(subset=[...]): Removes any row where imputation was not possible — for example, an entirely null series slice with no adjacent values to fill from.
7. Execution
df_filled = handle_missing(df_raw)
print("--- Imputed OHLCV Data ---")
display(df_filled)
print("\n--- Null Count After Imputation ---")
print(df_filled.isnull().sum())
print("\n--- Schema Summary ---")
df_filled.info()--- Imputed OHLCV Data ---
| datetime | open | high | low | close | volume | |
|---|---|---|---|---|---|---|
| 0 | 2024-01-01 00:00:00+00:00 | 42100.0 | 42300.0 | 41900.0 | 42200.0 | 10.5 |
| 1 | 2024-01-01 00:01:00+00:00 | 42100.0 | 42300.0 | 41900.0 | 42200.0 | 0.0 |
| 2 | 2024-01-01 00:02:00+00:00 | 42200.0 | 42400.0 | 42100.0 | 42350.0 | 8.2 |
| 3 | 2024-01-01 00:03:00+00:00 | 42200.0 | 42350.0 | 42000.0 | 42300.0 | 0.0 |
| 4 | 2024-01-01 00:04:00+00:00 | 42500.0 | 42700.0 | 42400.0 | 42600.0 | 12.3 |
| 5 | 2024-01-01 00:05:00+00:00 | 42550.0 | 42800.0 | 42450.0 | 42700.0 | 9.1 |
--- Null Count After Imputation --- datetime 0 open 0 high 0 low 0 close 0 volume 0 dtype: int64 --- Schema Summary --- <class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 6 non-null datetime64[ns, UTC] 1 open 6 non-null float64 2 high 6 non-null float64 3 low 6 non-null float64 4 close 6 non-null float64 5 volume 6 non-null float64 dtypes: datetime64[ns, UTC](1), float64(5) memory usage: 420.0 bytes