TY - JOUR T1 - All That Glitters Is Not Gold: <em>Comparing Backtest and Out-of-Sample Performance on a Large Cohort of Trading Algorithms</em> JF - The Journal of Investing SP - 69 LP - 80 DO - 10.3905/joi.2016.25.3.069 VL - 25 IS - 3 AU - Thomas Wiecki AU - Andrew Campbell AU - Justin Lent AU - Jessica Stauth Y1 - 2016/08/31 UR - https://pm-research.com/content/25/3/69.abstract N2 - When automated trading strategies are developed and evaluated using backtests on historical pricing data, there exists a tendency to overfit to the past. Using a unique dataset of 888 algorithmic trading strategies developed and backtested on the Quantopian platform, with at least six months of out-of-sample performance, this article studies the prevalence and impact of backtest overfitting. Specifically, the authors find that commonly reported backtest evaluation metrics, such as the Sharpe ratio, offer little value in predicting out-of-sample performance (R 2 &lt; 0.025). In contrast, higher-order moments, such as volatility and maximum drawdown, as well as portfolio construction features (e.g., hedging), show significant predictive value of relevance to quantitative finance practitioners. Moreover, in line with prior theoretical considerations, the authors find empirical evidence of overfitting—the more backtesting a quant has done for a strategy, the larger the discrepancy between backtest and out-of-sample performance. Finally, they show that by training nonlinear, machine-learning classifiers on a variety of features that describe backtest behavior, out-of-sample performance can be predicted with much greater accuracy (R 2 = 0.17) on hold-out data than when using linear, univariate features. A portfolio constructed by using predictions on hold-out data performed significantly better out-of-sample than one constructed from algorithms with the highest backtest Sharpe ratios.TOPICS: Statistical methods, portfolio construction, portfolio theory ER -