Today, we dive into a story that has defined data science for forty years: the Forecasting Paradox. It’s the counterintuitive empirical observation that complex, sophisticated models consistently failed to beat simpler, more robust statistical methods when trying to predict the future. We track this paradox from its canonical proof through the massive M-Competitions, revealing how the debate over complexity vs. accuracy was settled—and then immediately overturned—based entirely on the structure of the data.Part I: The Sad Conclusion: When Simplicity Became LawFor decades, the intuitive assumption was that a complex model, capable of capturing intricate patterns, should inherently be a better predictor. However, the empirical evidence gathered by pioneers like Spyros Makridakis through the M-Competitions proved the opposite.The M3-Competition (published in 2000) provided the foundational, canonical proof of the Forecasting Paradox. Testing 24 methods across 3,003 diverse real-world time series, the results delivered what was called the “sad conclusion”: statistically sophisticated and complex methods, including early Artificial Neural Network (ANN) models, were consistently outperformed by simple approaches.The ultimate victor of M3 was the relatively obscure Theta method. This decompositional approach, which broke the series into simple trend and curvature lines, cemented the “simplicity mantra” as the dominant, evidence-based paradigm for nearly twenty years. This finding was later demystified by Rob J. Hyndman and Billah, who proved the Theta method was mathematically equivalent to Simple Exponential Smoothing (SES) with drift—a clever parameterization of one of the simplest methods available.Why Simplicity Won: The Trap of OverfittingThe paradox held because of a crucial distinction: “goodness-of-fit” versus “forecasting accuracy”. Complex models, with many parameters, would inevitably fall into the central sin of overfitting. They learned not just the underlying true structure (the signal), but also the random historical fluctuations (the noise), mistaking error for a repeatable pattern. When forecasting, the model projected this learned noise into the future, making erratic and inaccurate predictions.Simple models like Exponential Smoothing, conversely, are mathematically incapable of modeling complex noise; they act as robust “noise filters”. A 2018 PLOS ONE paper reaffirmed this, demonstrating that modern Machine Learning methods were “dominated” by traditional statistical benchmarks when tested on M3 data, often achieving better in-sample fit but worse out-of-sample accuracy.Part II: The Hybrid Breakthrough (M4)The M4-Competition (2018), instigated by Makridakis to rigorously test modern ML, represented a massive escalation, using 100,000 independent time series.The M4 results confirmed the persistence of the paradox: the six pure Machine Learning methods submitted all performed poorly, failing to beat the simple statistical “Comb” benchmark (a simple average of three statistical methods).However, the competition was won by Slawek Smyl’s novel Exponential Smoothing-Recurrent Neural Network (ES-RNN) hybrid model. This marked a pivotal moment, signaling the end of the “forecasting winter” where simple models reigned. The ES-RNN’s success proved that the path forward was not simplicity or complexity, but a synthesis of the two. The model used classical statistical Exponential Smoothing to pre-process the data (handling seasonality and level) and present a cleaner, more stationary residual series to the Recurrent Neural Network for complex modeling.This refined the historical understanding: pure complex models still failed due to overfitting the noise, but robust, noise-filtered hybrid models could finally surpass the simple benchmarks.Part III: The Overturn: Complexity Wins in the Global Era (M5)Just two years later, the M5-Competition (2020) provided a complete and shocking reversal. The dataset was fundamentally different: 42,840 hierarchical, interrelated sales series from Walmart, critically including explanatory variables like prices and promotions for the first time.In this environment, the paradox was definitively broken. Pure Machine Learning methods dominated the leaderboard, with the winning entry being an ensemble of LightGBM (Light Gradient Boosting Machine) models. These pure ML methods significantly outperformed all statistical benchmarks and their combinations.The Triumph of the Global ModelThe shift was enabled by a change in paradigm from “local” to “global“ modeling.* Local Models (The Old Way): Traditional statistical methods treated each of the 42,840 series independently, building a separate, “blind” model for each.* Global Models (The New Way): The M5 winners trained a single, massive LightGBM model on all 42,840 series simultaneously.This allowed for cross-learning, where the complex model learned global patterns (e.g., the general effect of a promotion or price elasticity) and generalized that knowledge across different products and stores. The M5 problem was high-signal and multivariate, meaning the simple, local statistical models were too “dumb” to utilize the rich external features and cross-sectional information, allowing the complex Global ML models to dominate.The Takeaway: A Context-Dependent FrameworkThe 40-year debate is resolved not as a binary choice, but as a context-dependent framework:* Paradox is TRUE for Local, Univariate Data: If you are forecasting a single, isolated time series with limited features (like the M3/M4 context), simple statistical methods or robust statistical-ML hybrids (like ES-RNN) are the state-of-the-art choice. Pure ML will likely overfit and fail.* Paradox is FALSE for Global, Multivariate Data: If you have thousands of related series and rich external features (like the M5 context), a Global ML model (like LightGBM) is the essential tool for utilizing all available signals.Coordinated with Fredrik Pro-Tip: The Metric WarsThe evolution of the M-Competitions was not just about the models, but the rules of engagement. For decades, the comparison was built on metrics like the Mean Absolute Percentage Error (MAPE) and sMAPE. These metrics, however, are now understood to be flawed—especially because they explode or become undefined near zero values and tend to be asymmetric, disproportionately punishing complex, volatile models. The M1-M3 paradox’s magnitude was likely an artifact of this flawed metric. Rob J. Hyndman solved this by introducing the Mean Absolute Scaled Error (MASE), a scale-free, symmetric, and robust metric. Makridakis adopted MASE for M4 and M5, ensuring a fairer evaluation and creating the necessary level playing field for ML models to be accurately tested and eventually triumph in M5. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit frahlg.substack.com
No persons identified in this episode.
This episode hasn't been transcribed yet
Help us prioritize this episode for transcription by upvoting it.
Popular episodes get transcribed faster
Other recent transcribed episodes
Transcribed and ready to explore now
3ª PARTE | 17 DIC 2025 | EL PARTIDAZO DE COPE
01 Jan 1970
El Partidazo de COPE
13:00H | 21 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana
12:00H | 21 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana
10:00H | 21 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana
13:00H | 20 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana
12:00H | 20 DIC 2025 | Fin de Semana
01 Jan 1970
Fin de Semana