December 16, 2024 Data Engineering

Kallos Data: Cryptocurrency Market Intelligence Pipeline

Production-grade data engineering pipeline for cryptocurrency market intelligence. Automates daily OHLCV data collection, computes 30+ technical indicators, generates trading signals, and maintains a market-cap weighted crypto index with comprehensive error handling and database management.

Data Engineering Python PostgreSQL ETL CoinGecko API Technical Analysis Market Data

The Foundation of Every Trading System

Machine learning models and portfolio optimizers are only as good as their data. Unreliable prices, missing values, calculation errors in technical indicators—any of these quietly sabotage downstream performance while appearing to work. Kallos Data was built to eliminate these failure modes by implementing production-grade data engineering from the start.

This ETL pipeline serves as the data backbone for the entire Kallos trading system, pulling cryptocurrency market data from the CoinGecko API, computing 30+ technical indicators with proper warmup periods, and maintaining a dynamically-rebalanced market-cap weighted index. The infrastructure emphasizes reliability through comprehensive error handling, data quality validation, and idempotent operations that safely handle failures and retries.

The system processes over 15,000 cryptocurrencies daily, calculating features that feed machine learning models and portfolio optimization algorithms. Downstream systems depend on this infrastructure’s temporal consistency, numerical accuracy, and operational reliability—requirements that shaped every design decision.

Production ETL Architecture

The pipeline implements a modular design where specialized processors handle distinct data types. The coin list processor maintains the investable universe. The market data processor fetches daily OHLCV (open, high, low, close, volume) data with batch optimization and parallel API requests. The technical indicators processor computes momentum, trend, volatility, and volume metrics. Each processor inherits from base classes that provide database connections, logging, and error handling.

The CoinGecko API client implements comprehensive resilience features: token bucket rate limiting respects API quotas, exponential backoff handles transient failures, and selective retry logic distinguishes between recoverable errors (429, 5xx status codes) and permanent failures (404, 401). The system logs every API call with structured JSON containing timing, status codes, and error details for monitoring and debugging.

Database operations use PostgreSQL with async connections via asyncpg for non-blocking I/O. Upsert operations leverage ON CONFLICT DO UPDATE for idempotent inserts—the pipeline can safely re-run without corrupting data. Batch processing limits memory usage by handling 50 coins at a time with explicit garbage collection between batches. Connection pooling reuses database connections for efficiency.

Technical Indicators with Proper Context

The system computes over 30 technical indicators across trend (EMA, ADX), momentum (RSI, MACD), volatility (Bollinger Bands, ATR), and volume (MFI, CMF) categories. Beyond standard indicators, the pipeline generates 10 composite signals that capture relationships between base indicators: EMA differentials quantify trend strength, RSI deviations measure momentum extremes, MACD histograms show momentum acceleration, and volatility ratios compare current to historical dispersion.

Critical to reliability: every indicator calculation includes a warmup period. When calculating indicators for the past 60 days, the system loads 120 days of raw data, computes indicators across the full period, then trims the first 60 days to eliminate unstable initial values. This prevents artifacts from insufficient lookback windows that would bias model training or signal generation.

Data quality validation runs at multiple stages. OHLC relationships get verified (high ≥ low, high ≥ open/close, low ≤ open/close). Indicator values check against theoretical bounds (RSI 0-100, Stochastic 0-100, ATR > 0). Chronological ordering ensures temporal consistency. The system logs validation failures with context for diagnosis but continues processing valid data rather than failing entirely on edge cases.

Market-Cap Weighted Index Construction

The pipeline maintains a dynamically-rebalanced cryptocurrency index that serves as the performance benchmark for trading strategies. Every month, the system selects the top 20 cryptocurrencies by market capitalization, applies quality filters (3+ years history, 90%+ data coverage), calculates market-cap proportional weights, then applies iterative capping to prevent single-asset dominance beyond 35%.

Between monthly rebalances, weights drift naturally with price performance—if Bitcoin outperforms, its weight increases until the next rebalance resets to target weights. This mirrors real index behavior and minimizes turnover costs. The weight drift calculation tracks constituent weights daily, computing index returns as the weighted sum of constituent returns using previous-day weights.

This index provides the passive benchmark that machine learning strategies must beat after transaction costs. During the 2023-2024 study period, the index achieved higher returns than actively optimized portfolios, revealing that concentrated exposure to Bitcoin and Ethereum captured bull market momentum more effectively than diversification—a finding that directly informed research conclusions about ML strategy performance.

Database Design for Time-Series Queries

The schema separates concerns across specialized tables: coins stores asset metadata, daily_market_data holds OHLCV records, daily_technical_indicators contains computed features, and index_monthly_constituents tracks index composition. Composite indexes optimize time-series queries (coin_id, date DESC), unique constraints prevent duplicates, and foreign keys enforce referential integrity.

Numeric precision uses 8 decimal places for prices and 6 for percentages, handling cryptocurrency’s wide value range from Bitcoin at $60,000 to micro-cap tokens at $0.000001. The asyncpg driver provides connection pooling, transaction management, and prepared statements for efficiency. Async operations allow concurrent database writes during batch processing without blocking.

The complete schema includes materialized views for common queries and stored procedures for complex aggregations. Database indexes support both analytical queries (full table scans with date filters) and operational queries (single asset lookups). This dual optimization enables both research backtesting and real-time model serving.

Operational Reliability Through Design

The pipeline runs daily via cron scheduling, executing a fixed workflow: update the coin universe weekly, fetch daily market data for active assets, calculate technical indicators in parallel, generate trading signals based on indicator combinations, and update the crypto index on month-end. JSON structured logging captures every step with timestamps, durations, coin IDs, and success/failure status.

Error handling distinguishes between recoverable and fatal failures. Missing data for a single asset logs a warning but continues processing others. API rate limit violations trigger exponential backoff and retry. Database constraint violations indicate data quality issues and abort the batch. Every failure includes context: which coin, what operation, what error, enabling rapid diagnosis.

The modular architecture enables testing components in isolation. Processors expose interfaces that accept data rather than always fetching from APIs, allowing unit tests with mock data. The database manager implements transaction rollback on errors, preventing partial writes. Idempotent operations mean failed runs can simply re-execute without manual cleanup.

Lessons from Production Operation

Building reliable financial data infrastructure revealed several insights. First, API reliability varies significantly—the pipeline handles 2-3% failure rates on individual coin requests gracefully through retries and logging. Second, data validation catches surprising edge cases: negative prices from API bugs, OHLC violations from exchange data errors, week-long gaps in illiquid assets. Third, memory management matters—processing 15,000 assets requires explicit batching and garbage collection to prevent OOM errors.

The most important lesson: downstream systems amplify upstream errors. A single day of bad indicator calculations can corrupt ML model training for weeks. Missing values in covariance matrices cause portfolio optimization failures. The data layer’s reliability directly determines system-wide robustness.

Technologies

Data Collection: Python 3.8+, aiohttp, CoinGecko Pro API

Processing: pandas, pandas-ta (130+ indicators), NumPy

Storage: PostgreSQL 13+, asyncpg

Infrastructure: JSON structured logging, dotenv configuration management

Explore the Pipeline

This production-grade ETL infrastructure demonstrates best practices for financial data engineering: comprehensive error handling, multi-layer validation, scalable architecture, and seamless integration with downstream ML and portfolio systems. The complete implementation including the API client, processor modules, and database schema is available on GitHub.

View Repository →

Part of the Kallos trading system research, providing the reliable data foundation that enables machine learning forecasting and portfolio optimization with confidence in data quality and temporal consistency.