Kallos Portfolios: Cryptocurrency Portfolio Optimization Framework
Advanced portfolio optimization framework comparing machine learning (GRU) predictions against traditional mean-variance and market-cap weighted strategies. Features quarterly model rotation, rigorous statistical testing, and production-ready backtesting infrastructure.
Does Machine Learning Actually Improve Portfolio Construction?
This is the question that quantitative researchers debate endlessly: can neural network forecasts translate into better investment decisions? Academic papers showcase impressive prediction accuracy, but in portfolio management, what matters is risk-adjusted returns after transaction costs. Kallos Portfolios was built to answer this question rigorously.
The framework constructs and compares three cryptocurrency portfolios over 18 months: one using GRU neural network forecasts, one using traditional historical averages, and one passively tracking market-cap weights. All three employ identical optimization constraints, isolating the value of machine learning predictions from other portfolio construction decisions. Statistical hypothesis testing determines whether observed performance differences could have occurred by chance.
The surprising finding? Better forecasts don’t automatically mean better portfolios. Despite the GRU models achieving 20% lower prediction error, the resulting portfolios only slightly outperformed traditional approaches—and that difference wasn’t statistically significant. This result matters more than unambiguous success would have, revealing exactly where forecasting improvements get lost in translation to trading performance.
The Three-Strategy Comparison
The research design controls for everything except the return forecasting method. Strategy One uses quarterly-trained GRU models that predict 30-day forward returns. The models automatically rotate each quarter to prevent overfitting and adapt to regime changes—Q4 2022 models handle January-March 2023 rebalancing, Q1 2023 models handle April-June, and so on. This operational realism reflects how production systems must continuously retrain to maintain performance.
Strategy Two implements textbook mean-variance optimization using 252-day historical returns. This serves as the fair comparison baseline—identical constraints, identical optimization algorithm, but substituting historical averages for neural network forecasts. Strategy Three simply weights assets by market capitalization, representing the passive “do nothing” approach. All three strategies rebalance monthly, face identical transaction costs (10 basis points), and respect the same position limits.
The portfolio optimizer maximizes Sharpe ratio subject to realistic constraints: no position exceeds 35% of capital, at least three assets must be held, and total leverage equals one. The covariance matrix gets estimated from the same 252-day window across all strategies, ensuring differences arise solely from return forecasts. This controlled design isolates the machine learning contribution.
Production Architecture for Real-World Deployment
The system architecture reflects production engineering principles, not academic convenience. All three strategies inherit from a base class that implements the common backtesting workflow—loading rebalancing dates, retrieving the investable universe, calculating period returns, and aggregating results. Subclasses override a single method: get_expected_returns_for_date(). The GRU strategy loads quarterly models and generates predictions. The historical strategy calculates trailing averages. The market-cap strategy retrieves capitalization weights.
This inheritance-based design eliminates code duplication while maintaining flexibility. Adding a new strategy requires implementing one method. The base class handles database connections, performance calculation, transaction cost modeling, and result persistence. When the GRU predictor loads models, it caches them to avoid redundant disk I/O during monthly rebalancing. Database operations use asyncpg for non-blocking concurrent access.
The optimizer includes sophisticated preprocessing for cryptocurrency data quality issues. Extreme returns get capped at ±500% to prevent covariance matrix singularities. Weekend gaps get forward-filled. The system verifies the covariance matrix is positive semi-definite, adding diagonal regularization if negative eigenvalues appear. Standard optimization libraries assume clean data—cryptocurrency markets violate those assumptions, requiring domain-specific handling.
Rigorous Statistical Validation
Performance evaluation goes far beyond comparing Sharpe ratios. The framework implements four hypothesis tests for pairwise strategy comparison: paired t-tests for mean return differences, F-tests for variance equality, Kolmogorov-Smirnov tests for distribution similarity, and stochastic dominance tests for preference ordering. Every claimed performance difference must survive statistical scrutiny at p < 0.05 significance.
The t-test revealed the critical finding: while GRU portfolios achieved 4% higher Sharpe ratios than historical optimization (1.12 vs 1.08), the hypothesis test returned p = 0.18. There’s an 18% probability this difference arose from random chance—far above the 5% threshold required to claim genuine outperformance. Effect size analysis showed Cohen’s d = 0.22, indicating a small practical difference even if it were statistically significant.
More surprising, both GRU and historical strategies underperformed the passive market-cap benchmark (Sharpe ratio 1.21) during the 2023-2024 study period. Tests confirmed this as statistically significant (p = 0.007). The benchmark’s concentrated exposure to Bitcoin and Ethereum captured momentum during the bull market more effectively than diversified optimized portfolios. This finding challenges the assumption that sophisticated optimization always beats passive indexing.
Quarterly Model Rotation and Temporal Integrity
The GRU prediction pipeline demonstrates critical operational realism often missing from academic research. Models train on quarterly windows and automatically deploy based on rebalancing dates. January rebalancing uses the Q4 2022 model. April uses Q1 2023. This rotation prevents the performance degradation observed when static models face shifting market dynamics.
During inference, the predictor loads historical features from the database, normalizes them using the fitted scaler from training, and generates 90-day forecasts from the GRU. The system extracts the 30-day forward prediction and feeds it into portfolio optimization. Model and scaler pairs get cached after first load, avoiding redundant disk I/O. The entire pipeline respects temporal boundaries—forecasts for May 2023 use only data available through April 2023.
This temporal discipline revealed an important operational cost: models need quarterly retraining to maintain effectiveness. A single train/test split would have concealed this requirement, producing artificially stable results. The walk-forward design surfaces the real computational and infrastructure costs of keeping models current in production.
Where Forecasting Improvements Get Lost
The gap between 20% better predictions and statistically insignificant portfolio improvements teaches several lessons. Transaction costs provide the simplest explanation: higher forecast accuracy led to more frequent rebalancing, consuming the advantage in trading fees. The GRU strategy’s 23% higher turnover translated to 4.6% annual costs versus 3.2% for the benchmark.
More subtle is objective misalignment: the models trained to minimize RMSE and maximize direction accuracy, but portfolios optimize Sharpe ratio—a different goal. Better point forecasts don’t automatically translate into better risk-adjusted returns when those forecasts feed into mean-variance optimization equally sensitive to covariance estimates. Future research should explore training models directly on portfolio objectives rather than prediction metrics.
The strong bull market during the study period also obscured risk management benefits. When nearly everything rises, concentration beats diversification and momentum strategies shine. The GRU system’s 6.6 percentage point lower maximum drawdown (32.1% vs 38.7%) matters more in bear markets or high-volatility regimes. Different market conditions might reveal stronger ML advantages.
Professional Tooling and Reporting
The framework integrates with vectorbt for realistic backtesting including slippage and transaction costs, PyPortfolioOpt for mean-variance optimization with advanced constraints, and QuantStats for professional tearsheet generation. Each strategy produces comprehensive HTML reports with cumulative returns charts, rolling Sharpe ratios, drawdown analysis, and monthly return heatmaps.
The comparative analysis outputs a unified JSON document containing all performance metrics and hypothesis test results—t-statistics, p-values, confidence intervals, effect sizes, and dominance relationships. This structured output enables systematic analysis across multiple backtest runs or parameter configurations. Database integration uses async operations with connection pooling for efficient data access.
Technologies
Core Stack: Python 3.8+, PyTorch, PostgreSQL, asyncpg
Optimization & Backtesting: PyPortfolioOpt, CVXPY, vectorbt, QuantStats
Scientific Computing: pandas, NumPy, SciPy, scikit-learn
Explore the Framework
This framework demonstrates end-to-end quantitative portfolio management—from machine learning predictions through optimization, rigorous backtesting, and statistical validation. The complete implementation, including the inheritance-based simulator architecture, statistical testing suite, and async database operations, is available on GitHub.
Part of the Kallos trading system research, exploring whether deep learning improves cryptocurrency portfolio construction through controlled experimentation and rigorous hypothesis testing.