Clean OHLCV Data

Full OHLCV cleaning pipeline: outlier removal, zero-volume detection, HLOC consistency checks, and duplicate bar elimination.

cleaningoutliersvalidation

OHLCV Data Cleaning Framework

This notebook defines a standardized protocol for cleaning raw OHLCV candlestick data. It covers duplicate removal, negative value filtering, zero-price detection, column type enforcement, and timestamp-to-datetime conversion on a representative dummy dataset.

1. Dependency Installation

[1]

!pip install pandas

Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Requirement already satisfied: numpy>=1.26.0 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)

2. Library Imports

[2]

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

Code Logic

pandas: Provides the DataFrame structure and vectorized cleaning operations.
numpy: Supplies np.nan for explicit null injection into the dummy dataset.

3. What Is OHLCV Data?

OHLCV stands for Open, High, Low, Close, Volume — the five fields that describe a single candlestick bar in financial time-series data.

| Field | Definition | |---| | open | Price at which the first trade executed in the candle window | | high | Highest traded price within the candle window | | low | Lowest traded price within the candle window | | close | Price at which the last trade executed in the candle window | | volume | Total quantity of the asset traded within the candle window | | datetime | UTC timestamp marking the start of the candle window |

Raw OHLCV data received from exchange APIs frequently contains corruption artifacts: duplicate rows from overlapping API requests, negative prices from encoding errors, zero-filled placeholder candles for periods with no trades, and null values from partially transmitted responses. Cleaning is a mandatory pre-processing step before any quantitative computation.

4. Dummy Dataset

[3]

raw_data = {
    "timestamp": [
        1704067200000, 1704067200000,  # duplicate rows
        1704067260000,
        1704067320000,
        1704067380000,
        1704067440000,
    ],
    "open":   [42100.0, 42100.0,  42200.0, -100.0,   0.0,     42500.0],
    "high":   [42300.0, 42300.0,  42400.0,  42350.0, 0.0,     42700.0],
    "low":    [41900.0, 41900.0,  42100.0,  42000.0, 0.0,     42400.0],
    "close":  [42200.0, 42200.0,  42350.0,  np.nan,  0.0,     42600.0],
    "volume": [10.5,    10.5,     8.2,      7.1,     np.nan,  12.3   ],
}

df_raw = pd.DataFrame(raw_data)

print("--- Raw OHLCV Data ---")
display(df_raw)

--- Raw OHLCV Data ---

	timestamp	open	high	low	close	volume
0	1704067200000	42100.0	42300.0	41900.0	42200.0	10.5
1	1704067200000	42100.0	42300.0	41900.0	42200.0	10.5
2	1704067260000	42200.0	42400.0	42100.0	42350.0	8.2
3	1704067320000	-100.0	42350.0	42000.0	NaN	7.1
4	1704067380000	0.0	0.0	0.0	0.0	NaN
5	1704067440000	42500.0	42700.0	42400.0	42600.0	12.3

Code Logic

Row 0 and Row 1 are exact duplicates, including timestamp, to simulate repeated API responses.
Row 3 contains a negative open value to simulate a data corruption artifact.
Row 4 contains zero prices and a null volume to simulate a missing candle written as zeros.
Row 3 contains a null close to simulate a partially formed candle.

5. Cleaning Function

[4]

def clean_ohlcv(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # --- Type Enforcement ---
    df["timestamp"] = df["timestamp"].astype("int64")
    df[["open", "high", "low", "close", "volume"]] = (
        df[["open", "high", "low", "close", "volume"]].astype("float64")
    )

    # --- Timestamp → Datetime Conversion ---
    # Unix millisecond timestamps are not human-readable and cannot be
    # used directly in time-series operations. Converting to UTC datetime
    # enables resampling, timezone handling, and readable inspection.
    df["datetime"] = pd.to_datetime(df["timestamp"], unit="ms", utc=True)
    df = df.drop(columns=["timestamp"])

    # --- Duplicate Removal ---
    df = df.drop_duplicates()
    df = df.drop_duplicates(subset=["datetime"], keep="first")

    # --- Null Removal ---
    df = df.dropna(subset=["open", "high", "low", "close", "volume"])

    # --- Zero and Negative Price Removal ---
    price_cols = ["open", "high", "low", "close"]
    df = df[(df[price_cols] > 0).all(axis=1)]

    # --- Negative Volume Removal ---
    df = df[df["volume"] >= 0]

    # --- Structural Integrity: High >= Low ---
    df = df[df["high"] >= df["low"]]

    # --- Sort and Reset ---
    df = df.sort_values("datetime", ignore_index=True)

    return df[["datetime", "open", "high", "low", "close", "volume"]]

Code Logic

Type enforcement

.astype("int64") / .astype("float64"): Enforces typed columns before any filtering to avoid silent comparison errors on mixed-type data received as strings from the API.

Timestamp to datetime

A Unix millisecond timestamp is an integer representing the number of milliseconds elapsed since 1 January 1970 00:00:00 UTC (the Unix epoch). Exchange APIs universally return timestamps in this format because integers are unambiguous across timezones and locales. However, integers cannot be used directly for time-based operations such as resampling or plotting.
pd.to_datetime(df["timestamp"], unit="ms", utc=True): Converts the integer millisecond value to a human-readable UTC-aware datetime object. unit="ms" tells pandas the integer is in milliseconds. utc=True attaches the UTC timezone, preventing ambiguous naive datetime comparisons downstream.
df.drop(columns=["timestamp"]): Removes the raw integer column after conversion — the datetime column supersedes it for all downstream operations.

Duplicate removal

drop_duplicates(): Removes fully identical rows resulting from repeated API calls or overlapping data ingestion windows.
drop_duplicates(subset=["datetime"], keep="first"): Removes rows sharing the same datetime where values may differ — both cannot represent the same candle window, so only the first occurrence is retained.

Null removal

dropna(subset=[...]): Discards any row missing a required OHLCV value. Partial candles cannot be used for downstream computation — an undefined close price makes the bar arithmetically incomplete.

Zero and negative price removal

(df[price_cols] > 0).all(axis=1): Vectorized row-level mask. A price of zero or below is physically impossible for a traded asset — both conditions indicate corrupt or placeholder data written by the exchange during a period of zero activity.

Volume

df["volume"] >= 0: Permits zero volume (valid for illiquid periods where no trades executed) while rejecting negative volume as a definitive data error.

Structural integrity

df["high"] >= df["low"]: Enforces the fundamental OHLCV invariant — the highest price in a candle window cannot be less than the lowest price. Violation indicates data corruption.

Sort

sort_values("datetime", ignore_index=True): Restores chronological order after row removal and resets the integer index.

6. Execution

[5]

df_clean = clean_ohlcv(df_raw)

print("--- Cleaned OHLCV Data ---")
display(df_clean)

print("\n--- Schema Summary ---")
df_clean.info()

--- Cleaned OHLCV Data ---

	datetime	open	high	low	close	volume
0	2024-01-01 00:00:00+00:00	42100.0	42300.0	41900.0	42200.0	10.5
1	2024-01-01 00:01:00+00:00	42200.0	42400.0	42100.0	42350.0	8.2
2	2024-01-01 00:04:00+00:00	42500.0	42700.0	42400.0	42600.0	12.3


--- Schema Summary ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype              
---  ------    --------------  -----              
 0   datetime  3 non-null      datetime64[ns, UTC]
 1   open      3 non-null      float64            
 2   high      3 non-null      float64            
 3   low       3 non-null      float64            
 4   close     3 non-null      float64            
 5   volume    3 non-null      float64            
dtypes: datetime64[ns, UTC](1), float64(5)
memory usage: 276.0 bytes

[ ]