Clean OHLCV Data
Full OHLCV cleaning pipeline: outlier removal, zero-volume detection, HLOC consistency checks, and duplicate bar elimination.
OHLCV Data Cleaning Framework
This notebook defines a standardized protocol for cleaning raw OHLCV candlestick data. It covers duplicate removal, negative value filtering, zero-price detection, column type enforcement, and timestamp-to-datetime conversion on a representative dummy dataset.
1. Dependency Installation
!pip install pandasRequirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2) Requirement already satisfied: numpy>=1.26.0 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.0.2) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
2. Library Imports
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as npCode Logic
pandas: Provides the DataFrame structure and vectorized cleaning operations.numpy: Suppliesnp.nanfor explicit null injection into the dummy dataset.
3. What Is OHLCV Data?
OHLCV stands for Open, High, Low, Close, Volume — the five fields that describe a single candlestick bar in financial time-series data.
| Field | Definition |
|---|
| open | Price at which the first trade executed in the candle window |
| high | Highest traded price within the candle window |
| low | Lowest traded price within the candle window |
| close | Price at which the last trade executed in the candle window |
| volume | Total quantity of the asset traded within the candle window |
| datetime | UTC timestamp marking the start of the candle window |
Raw OHLCV data received from exchange APIs frequently contains corruption artifacts: duplicate rows from overlapping API requests, negative prices from encoding errors, zero-filled placeholder candles for periods with no trades, and null values from partially transmitted responses. Cleaning is a mandatory pre-processing step before any quantitative computation.
4. Dummy Dataset
raw_data = {
"timestamp": [
1704067200000, 1704067200000, # duplicate rows
1704067260000,
1704067320000,
1704067380000,
1704067440000,
],
"open": [42100.0, 42100.0, 42200.0, -100.0, 0.0, 42500.0],
"high": [42300.0, 42300.0, 42400.0, 42350.0, 0.0, 42700.0],
"low": [41900.0, 41900.0, 42100.0, 42000.0, 0.0, 42400.0],
"close": [42200.0, 42200.0, 42350.0, np.nan, 0.0, 42600.0],
"volume": [10.5, 10.5, 8.2, 7.1, np.nan, 12.3 ],
}
df_raw = pd.DataFrame(raw_data)
print("--- Raw OHLCV Data ---")
display(df_raw)--- Raw OHLCV Data ---
| timestamp | open | high | low | close | volume | |
|---|---|---|---|---|---|---|
| 0 | 1704067200000 | 42100.0 | 42300.0 | 41900.0 | 42200.0 | 10.5 |
| 1 | 1704067200000 | 42100.0 | 42300.0 | 41900.0 | 42200.0 | 10.5 |
| 2 | 1704067260000 | 42200.0 | 42400.0 | 42100.0 | 42350.0 | 8.2 |
| 3 | 1704067320000 | -100.0 | 42350.0 | 42000.0 | NaN | 7.1 |
| 4 | 1704067380000 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 5 | 1704067440000 | 42500.0 | 42700.0 | 42400.0 | 42600.0 | 12.3 |
Code Logic
- Row 0 and Row 1 are exact duplicates, including timestamp, to simulate repeated API responses.
- Row 3 contains a negative
openvalue to simulate a data corruption artifact. - Row 4 contains zero prices and a null
volumeto simulate a missing candle written as zeros. - Row 3 contains a null
closeto simulate a partially formed candle.
5. Cleaning Function
def clean_ohlcv(df: pd.DataFrame) -> pd.DataFrame:
df = df.copy()
# --- Type Enforcement ---
df["timestamp"] = df["timestamp"].astype("int64")
df[["open", "high", "low", "close", "volume"]] = (
df[["open", "high", "low", "close", "volume"]].astype("float64")
)
# --- Timestamp → Datetime Conversion ---
# Unix millisecond timestamps are not human-readable and cannot be
# used directly in time-series operations. Converting to UTC datetime
# enables resampling, timezone handling, and readable inspection.
df["datetime"] = pd.to_datetime(df["timestamp"], unit="ms", utc=True)
df = df.drop(columns=["timestamp"])
# --- Duplicate Removal ---
df = df.drop_duplicates()
df = df.drop_duplicates(subset=["datetime"], keep="first")
# --- Null Removal ---
df = df.dropna(subset=["open", "high", "low", "close", "volume"])
# --- Zero and Negative Price Removal ---
price_cols = ["open", "high", "low", "close"]
df = df[(df[price_cols] > 0).all(axis=1)]
# --- Negative Volume Removal ---
df = df[df["volume"] >= 0]
# --- Structural Integrity: High >= Low ---
df = df[df["high"] >= df["low"]]
# --- Sort and Reset ---
df = df.sort_values("datetime", ignore_index=True)
return df[["datetime", "open", "high", "low", "close", "volume"]]Code Logic
Type enforcement
.astype("int64") / .astype("float64"): Enforces typed columns before any filtering to avoid silent comparison errors on mixed-type data received as strings from the API.
Timestamp to datetime
- A Unix millisecond timestamp is an integer representing the number of milliseconds elapsed since 1 January 1970 00:00:00 UTC (the Unix epoch). Exchange APIs universally return timestamps in this format because integers are unambiguous across timezones and locales. However, integers cannot be used directly for time-based operations such as resampling or plotting.
pd.to_datetime(df["timestamp"], unit="ms", utc=True): Converts the integer millisecond value to a human-readable UTC-aware datetime object.unit="ms"tells pandas the integer is in milliseconds.utc=Trueattaches the UTC timezone, preventing ambiguous naive datetime comparisons downstream.df.drop(columns=["timestamp"]): Removes the raw integer column after conversion — thedatetimecolumn supersedes it for all downstream operations.
Duplicate removal
drop_duplicates(): Removes fully identical rows resulting from repeated API calls or overlapping data ingestion windows.drop_duplicates(subset=["datetime"], keep="first"): Removes rows sharing the same datetime where values may differ — both cannot represent the same candle window, so only the first occurrence is retained.
Null removal
dropna(subset=[...]): Discards any row missing a required OHLCV value. Partial candles cannot be used for downstream computation — an undefinedcloseprice makes the bar arithmetically incomplete.
Zero and negative price removal
(df[price_cols] > 0).all(axis=1): Vectorized row-level mask. A price of zero or below is physically impossible for a traded asset — both conditions indicate corrupt or placeholder data written by the exchange during a period of zero activity.
Volume
df["volume"] >= 0: Permits zero volume (valid for illiquid periods where no trades executed) while rejecting negative volume as a definitive data error.
Structural integrity
df["high"] >= df["low"]: Enforces the fundamental OHLCV invariant — the highest price in a candle window cannot be less than the lowest price. Violation indicates data corruption.
Sort
sort_values("datetime", ignore_index=True): Restores chronological order after row removal and resets the integer index.
6. Execution
df_clean = clean_ohlcv(df_raw)
print("--- Cleaned OHLCV Data ---")
display(df_clean)
print("\n--- Schema Summary ---")
df_clean.info()--- Cleaned OHLCV Data ---
| datetime | open | high | low | close | volume | |
|---|---|---|---|---|---|---|
| 0 | 2024-01-01 00:00:00+00:00 | 42100.0 | 42300.0 | 41900.0 | 42200.0 | 10.5 |
| 1 | 2024-01-01 00:01:00+00:00 | 42200.0 | 42400.0 | 42100.0 | 42350.0 | 8.2 |
| 2 | 2024-01-01 00:04:00+00:00 | 42500.0 | 42700.0 | 42400.0 | 42600.0 | 12.3 |
--- Schema Summary --- <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 3 non-null datetime64[ns, UTC] 1 open 3 non-null float64 2 high 3 non-null float64 3 low 3 non-null float64 4 close 3 non-null float64 5 volume 3 non-null float64 dtypes: datetime64[ns, UTC](1), float64(5) memory usage: 276.0 bytes