Notebooks/Timezone Standardisation
Data·Cleaning·Beginner

Timezone Standardisation

Convert timestamps across exchange timezones to UTC, handle DST edge cases, and align multi-source data to a common index.

timezonesUTCdatetime

Timezone Standardization Framework

This notebook defines a standardized protocol for normalizing OHLCV timestamp columns to UTC. It covers naive timestamp localization, timezone-aware conversion, and UTC datetime derivation on a representative dummy dataset containing mixed timezone sources.


1. Dependency Installation

[10]
!pip install pandas
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Requirement already satisfied: numpy>=1.26.0 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)

2. Library Imports

[11]
import warnings
warnings.filterwarnings("ignore")

import pandas as pd

3. Why Timezone Standardization Matters

Every exchange operates its API in its own timezone or returns timestamps with local offsets. When data from multiple exchanges is merged without normalization, the same real-world moment in time appears as different values, causing misaligned joins, incorrect resampling, and silently corrupted backtests.

UTC (Coordinated Universal Time) is the international standard reference timezone — it has no offset, no daylight saving adjustment, and no regional variation. All financial data pipelines standardize to UTC because it is the only timezone that remains constant and unambiguous across all exchanges, jurisdictions, and calendar events.

Timezone-naive vs timezone-aware datetimes A naive datetime has no timezone attached — 2024-01-01 00:00:00 could mean midnight in any timezone. A timezone-aware datetime carries an explicit offset — 2024-01-01 00:00:00+05:30 means midnight Indian Standard Time, which equals 2024-12-31 18:30:00 UTC. Naive datetimes must be localized (assigned a timezone) before they can be converted to UTC. Aware datetimes are directly converted.


4. Dummy Dataset

[12]
raw_data = {
    "exchange": ["Bybit",                "Kraken",              "OKX",                  "Binance",             "Coinbase"            ],
    "datetime_str": [
        "2024-01-01 00:00:00",           # UTC naive
        "2024-01-01 05:30:00+05:30",     # IST  (UTC+5:30)
        "2024-01-01 09:00:00+09:00",     # JST  (UTC+9)
        "2023-12-31 19:00:00-05:00",     # EST  (UTC-5) - Corrected to map to 2024-01-01 00:00:00 UTC
        "2024-01-01 08:00:00+08:00",     # SGT  (UTC+8)
    ],
    "open":   [42100.0, 42200.0, 42150.0, 42300.0, 42250.0],
    "high":   [42300.0, 42400.0, 42350.0, 42500.0, 42450.0],
    "low":    [41900.0, 42000.0, 41950.0, 42100.0, 42050.0],
    "close":  [42200.0, 42150.0, 42300.0, 42250.0, 42400.0],
    "volume": [10.5,    8.2,     9.1,     11.3,    7.6    ],
}

df_raw = pd.DataFrame(raw_data)

print("--- Raw Data (Mixed Timezones) ---")
display(df_raw[["exchange", "datetime_str"]])
--- Raw Data (Mixed Timezones) ---
exchange datetime_str
0 Bybit 2024-01-01 00:00:00
1 Kraken 2024-01-01 05:30:00+05:30
2 OKX 2024-01-01 09:00:00+09:00
3 Binance 2023-12-31 19:00:00-05:00
4 Coinbase 2024-01-01 08:00:00+08:00

Code Logic

  • All five rows represent the same absolute moment in time: 2024-01-01 00:00:00 UTC. After normalization all datetime values must be identical — this is used as the verification condition in Section 6.
  • Row 0 is timezone-naive (Bybit returns UTC strings without an offset marker); all other rows carry explicit UTC offsets.

5. Timezone Standardization Function

[14]
def standardize_timezone(df: pd.DataFrame, datetime_col: str = "datetime_str") -> pd.DataFrame:

    # Parse all datetime strings to UTC-aware datetime
    # utc=True coerces all offsets to UTC; naive strings are treated as UTC
    df["datetime"] = pd.to_datetime(df[datetime_col], utc=True, format='ISO8601')

    # Explicitly convert to UTC to guarantee uniform tzinfo across all rows
    df["datetime"] = df["datetime"].dt.tz_convert("UTC")

    # Drop the raw source column
    df = df.drop(columns=[datetime_col])

    return df[["datetime", "exchange", "open", "high", "low", "close", "volume"]]

Code Logic

  • pd.to_datetime(df[datetime_col], utc=True): Parses a mixed column of timezone-naive and timezone-aware ISO 8601 strings in a single vectorized pass. utc=True applies two rules simultaneously — aware strings are converted to UTC by applying their offset arithmetic; naive strings (no offset marker) are assumed to be UTC and localized directly.
  • dt.tz_convert("UTC"): Applies an explicit final conversion pass to guarantee that every row carries a uniform UTC tzinfo object regardless of how it was parsed. Without this step, edge-case aware strings in non-UTC zones may retain their original offset internally even after utc=True parsing.
  • df.drop(columns=[datetime_col]): Removes the raw source string column, which is fully superseded by the normalized datetime column.

6. Execution

[15]
df_utc = standardize_timezone(df_raw, datetime_col="datetime_str")

print("--- Standardized UTC Datetimes ---")
display(df_utc[["exchange", "datetime"]])

print("\n--- Verification: All datetimes must be equal (same UTC moment) ---")
n_unique = df_utc["datetime"].nunique()
print(f"{n_unique} unique datetime value(s) — {'PASS' if n_unique == 1 else 'FAIL'}")

print("\n--- Schema Summary ---")
df_utc.info()
--- Standardized UTC Datetimes ---
exchange datetime
0 Bybit 2024-01-01 00:00:00+00:00
1 Kraken 2024-01-01 00:00:00+00:00
2 OKX 2024-01-01 00:00:00+00:00
3 Binance 2024-01-01 00:00:00+00:00
4 Coinbase 2024-01-01 00:00:00+00:00

--- Verification: All datetimes must be equal (same UTC moment) ---
1 unique datetime value(s) — PASS

--- Schema Summary ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype              
---  ------    --------------  -----              
 0   datetime  5 non-null      datetime64[ns, UTC]
 1   exchange  5 non-null      object             
 2   open      5 non-null      float64            
 3   high      5 non-null      float64            
 4   low       5 non-null      float64            
 5   close     5 non-null      float64            
 6   volume    5 non-null      float64            
dtypes: datetime64[ns, UTC](1), float64(5), object(1)
memory usage: 412.0+ bytes
[ ]