Data·Cleaning·Beginner

Symbol Format Handling

Normalise trading pair formats across exchanges — handle BTC/USDT, BTCUSDT, BTC-USDT variants in a unified pipeline.

symbolsnormalisationexchange

Symbol Format Standardization Framework

This notebook defines a standardized protocol for normalizing trading symbol identifiers across exchange-specific formats into a unified canonical schema. It covers delimiter normalization, quote currency inference, instrument type tagging, and case enforcement on a representative dummy dataset.


1. Dependency Installation

[ ]
!pip install pandas
Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (2.2.2)
Requirement already satisfied: numpy>=1.26.0 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas) (2026.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)

2. Library Imports

[ ]
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import re

Code Logic

  • pandas: Provides the DataFrame structure for tabular symbol mapping operations.
  • re: Supplies regex pattern matching for delimiter-agnostic symbol parsing.

3. Why Symbol Normalization Matters

Every exchange uses its own naming convention for the same underlying instrument. The Bitcoin/USDT perpetual futures contract appears as:

ExchangeRaw SymbolProblem
BybitBTCUSDTNo delimiter — base and quote are concatenated
KrakenXBT/USDSlash delimiter; XBT is a non-standard alias for BTC
OKXBTC-USDT-SWAPHyphen delimiter with instrument type suffix
BinanceETHUSDTSame concatenation format as Bybit
CoinbaseBTC-USDHyphen delimiter, clean format
BitfinextBTCUSDLowercase t prefix; no delimiter
DeribitBTC-31JAN25-50000-COptions contract with expiry, strike, and type

Without normalization, a database join or strategy lookup on BTCUSDT and XBT/USD will fail to match even though both refer to the same instrument. Canonical normalization resolves all variants to a single standard form: BTC-USDT.


4. Dummy Dataset

[ ]
raw_data = {
    "exchange": ["Bybit",     "Kraken",    "OKX",            "Binance",  "Coinbase",  "Bitfinex",  "Deribit"              ],
    "symbol":   ["BTCUSDT",   "XBT/USD",   "BTC-USDT-SWAP",  "ETHUSDT",  "BTC-USD",   "tBTCUSD",   "BTC-31JAN25-50000-C" ],
    "open":     [42100.0,     42150.0,     42200.0,           3200.0,     42180.0,     42170.0,     42160.0               ],
    "high":     [42300.0,     42350.0,     42400.0,           3250.0,     42380.0,     42370.0,     42360.0               ],
    "low":      [41900.0,     41950.0,     42000.0,           3150.0,     41980.0,     41970.0,     41960.0               ],
    "close":    [42200.0,     42250.0,     42300.0,           3220.0,     42280.0,     42270.0,     42260.0               ],
    "volume":   [10.5,        8.2,         9.1,               150.3,      7.6,         6.4,         3.2                   ]
}

df_raw = pd.DataFrame(raw_data)

print("--- Raw Symbol Formats ---")
display(df_raw[["exchange", "symbol"]])
--- Raw Symbol Formats ---
exchange symbol
0 Bybit BTCUSDT
1 Kraken XBT/USD
2 OKX BTC-USDT-SWAP
3 Binance ETHUSDT
4 Coinbase BTC-USD
5 Bitfinex tBTCUSD
6 Deribit BTC-31JAN25-50000-C

5. Symbol Normalization Function

[ ]
ALIAS_MAP = {
    "XBT":  "BTC",
    "XETH": "ETH",
    "XXBT": "BTC",
    "ZUSD": "USD",
}

QUOTE_CURRENCIES = ["USDT", "USDC", "USD", "BTC", "ETH", "BUSD", "EUR"]

def normalize_symbol(raw_symbol: str, exchange: str) -> dict:
    symbol = raw_symbol.upper().strip()

    # Strip Bitfinex leading 't' prefix
    if exchange.lower() == "bitfinex" and symbol.startswith("T"):
        symbol = symbol[1:]

    # Detect Deribit options/futures by expiry date pattern
    if re.search(r"-\d{2}[A-Z]{3}\d{2}-", symbol):
        parts = symbol.split("-")
        base  = ALIAS_MAP.get(parts[0], parts[0])
        return {
            "canonical": symbol,
            "base":      base,
            "quote":     "USD",
            "type":      "option" if symbol.endswith(("-C", "-P")) else "future",
        }

    # Normalize slash delimiter to hyphen
    symbol = symbol.replace("/", "-")

    # Strip instrument type suffixes
    for suffix in ["-SWAP", "-PERP", "-PERPETUAL"]:
        if symbol.endswith(suffix):
            symbol = symbol[: -len(suffix)]
            break

    # Split on hyphen delimiter
    if "-" in symbol:
        parts = symbol.split("-", 1)
        base  = ALIAS_MAP.get(parts[0], parts[0])
        quote = ALIAS_MAP.get(parts[1], parts[1])
        return {"canonical": f"{base}-{quote}", "base": base, "quote": quote, "type": "spot"}

    # Split delimiter-free symbols using known quote currency list
    for q in QUOTE_CURRENCIES:
        if symbol.endswith(q):
            base  = ALIAS_MAP.get(symbol[: -len(q)], symbol[: -len(q)])
            quote = ALIAS_MAP.get(q, q)
            return {"canonical": f"{base}-{quote}", "base": base, "quote": quote, "type": "spot"}

    return {"canonical": symbol, "base": symbol, "quote": "UNKNOWN", "type": "unknown"}


def handle_symbols(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    normalized = df.apply(
        lambda row: normalize_symbol(row["symbol"], row["exchange"]), axis=1
    )

    df["canonical"] = normalized.apply(lambda x: x["canonical"])
    df["base"]      = normalized.apply(lambda x: x["base"])
    df["quote"]     = normalized.apply(lambda x: x["quote"])
    df["type"]      = normalized.apply(lambda x: x["type"])

    return df[["exchange", "symbol", "canonical", "base", "quote", "type",
               "open", "high", "low", "close", "volume"]]

Code Logic

  • ALIAS_MAP: Maps exchange-specific base currency aliases to canonical equivalents. Kraken uses XBT as its internal code for Bitcoin — a legacy identifier from before the BTC ticker became universal. This table resolves it and similar aliases to the standard form.
  • QUOTE_CURRENCIES: Ordered list of known quote currencies for delimiter-free symbol splitting. Order is critical — USDT must appear before USD to prevent BTCUSDT being incorrectly split into base BTCUS and quote T.
  • symbol.startswith("T") with Bitfinex guard: Strips the Bitfinex trading pair prefix t (uppercased to T). Bitfinex prepends t to all spot trading pairs and f to all funding symbols — the guard ensures this strip is not applied to other exchanges where a leading T would be part of the base currency.
  • re.search(r"-\d{2}[A-Z]{3}\d{2}-", symbol): Detects Deribit derivative contracts. The regex matches a segment like -31JAN25- — a two-digit day, three-letter month abbreviation, and two-digit year — that appears in Deribit expiry-dated instruments. Regular spot or perpetual symbols never contain this pattern.
  • symbol.replace("/", "-"): Normalizes Kraken-style slash delimiters to the uniform hyphen separator used throughout the pipeline.
  • Suffix stripping loop: Removes OKX-style instrument type suffixes before base/quote extraction. Without this step, -SWAP would be treated as the quote currency.
  • symbol.split("-", 1): Splits on the first hyphen only (maxsplit=1). Without the limit, BTC-USDT-EXTRA would produce three parts instead of two.
  • df.apply(lambda row: ..., axis=1): Applies row-level normalization, passing both symbol and exchange — exchange identity is required for the Bitfinex prefix rule and any future exchange-specific branching.

6. Execution

[ ]
df_normalized = handle_symbols(df_raw)

print("--- Normalized Symbol Mapping ---")
display(df_normalized[["exchange", "symbol", "canonical", "base", "quote", "type"]])

print("\n--- Full Output with OHLCV ---")
display(df_normalized)

print("\n--- Schema Summary ---")
df_normalized.info()
--- Normalized Symbol Mapping ---
exchange symbol canonical base quote type
0 Bybit BTCUSDT BTC-USDT BTC USDT spot
1 Kraken XBT/USD BTC-USD BTC USD spot
2 OKX BTC-USDT-SWAP BTC-USDT BTC USDT spot
3 Binance ETHUSDT ETH-USDT ETH USDT spot
4 Coinbase BTC-USD BTC-USD BTC USD spot
5 Bitfinex tBTCUSD BTC-USD BTC USD spot
6 Deribit BTC-31JAN25-50000-C BTC-31JAN25-50000-C BTC USD option

--- Full Output with OHLCV ---
exchange symbol canonical base quote type open high low close volume
0 Bybit BTCUSDT BTC-USDT BTC USDT spot 42100.0 42300.0 41900.0 42200.0 10.5
1 Kraken XBT/USD BTC-USD BTC USD spot 42150.0 42350.0 41950.0 42250.0 8.2
2 OKX BTC-USDT-SWAP BTC-USDT BTC USDT spot 42200.0 42400.0 42000.0 42300.0 9.1
3 Binance ETHUSDT ETH-USDT ETH USDT spot 3200.0 3250.0 3150.0 3220.0 150.3
4 Coinbase BTC-USD BTC-USD BTC USD spot 42180.0 42380.0 41980.0 42280.0 7.6
5 Bitfinex tBTCUSD BTC-USD BTC USD spot 42170.0 42370.0 41970.0 42270.0 6.4
6 Deribit BTC-31JAN25-50000-C BTC-31JAN25-50000-C BTC USD option 42160.0 42360.0 41960.0 42260.0 3.2

--- Schema Summary ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   exchange   7 non-null      object 
 1   symbol     7 non-null      object 
 2   canonical  7 non-null      object 
 3   base       7 non-null      object 
 4   quote      7 non-null      object 
 5   type       7 non-null      object 
 6   open       7 non-null      float64
 7   high       7 non-null      float64
 8   low        7 non-null      float64
 9   close      7 non-null      float64
 10  volume     7 non-null      float64
dtypes: float64(5), object(6)
memory usage: 748.0+ bytes

7. Conclusion

This notebook demonstrates a robust framework for normalizing cryptocurrency trading symbols across various exchanges. By standardizing diverse naming conventions into a unified canonical format, it addresses critical challenges in data integration, ensuring consistent and reliable analysis across different platforms. The implemented normalize_symbol function effectively handles common variations such as delimiter differences, exchange-specific prefixes, instrument type suffixes, and alias mapping, providing a clean and actionable representation of trading instruments. This standardization is crucial for accurate data aggregation, backtesting, and live trading strategies, preventing mismatches and enabling seamless data interoperability.