Executive Summary
Neural network training proceeds through distinct regimes whose geometric properties can be detected and exploited. This paper introduces Leap+Verify, a speculative weight prediction framework that identifies three training regimes — chaotic, transition, and stable — using activation-space cosine similarity as a Lyapunov proxy. By adapting its prediction strategy to the current regime, the framework achieves significant training acceleration without sacrificing convergence guarantees.
The central empirical finding is a universal momentum catastrophe: Adam-style momentum extrapolation produces weight predictions that are 100 to 10,000 times worse than the true next-step weights across all regimes and model scales. This result rules out the most intuitive approach to speculative weight prediction and motivates the use of finite-difference predictors, which achieve 67-90% strict acceptance rates at 7B scale.
Validation spans three model families and scales — GPT-2 124M, Qwen 2.5-1.5B, and Qwen 2.5-7B — demonstrating that the regime detection mechanism and predictor performance generalize across architectures. The framework establishes activation geometry as a practical signal for training-time optimization, connecting speculative execution concepts from systems engineering to the dynamics of gradient descent.
Key Contributions and Methodology
The paper makes three primary contributions. First, it establishes that training dynamics partition naturally into three regimes detectable via a single scalar metric: the cosine similarity between consecutive activation fingerprints. This metric serves as a Lyapunov proxy, providing a computationally cheap signal for regime classification without requiring full spectral analysis of the loss landscape.
Second, it documents the universal momentum catastrophe. Across every model scale and training phase tested, extrapolating weights using Adam's first-moment estimates produces predictions that diverge catastrophically from the true trajectory. This finding has implications beyond the speculative prediction setting — it suggests that momentum-based intuitions about weight-space trajectories are fundamentally misleading.
Third, the paper develops and validates regime-adaptive finite-difference predictors. These predictors use recent weight history rather than optimizer state to forecast future weights, achieving acceptance rates sufficient for practical training acceleration. The verification mechanism ensures that any rejected prediction is simply replaced by the standard gradient step, preserving convergence guarantees.
Key Findings
- Three training regimes: Chaotic (high activation variance, low cosine similarity), transition (rapid geometry change), and stable (cosine similarity approaching 1.0) phases are consistently detectable across all model scales
- Universal momentum catastrophe: Adam extrapolation error is 100-10,000x worse than naive finite-difference prediction across all regimes, scales, and architectures
- Strict acceptance at scale: Finite-difference predictors achieve 67-90% strict acceptance rates at 7B parameters, with acceptance rates increasing in the stable regime
- Architecture invariance: Regime boundaries and predictor performance generalize across GPT-2 and Qwen model families without architecture-specific tuning
- Lyapunov proxy validity: Activation-space cosine similarity correlates strongly with loss-landscape curvature measures, validating its use as a cheap regime detector
Key References
Speculative Decoding with Big Little Models. International Conference on Learning Representations.
Adam: A Method for Stochastic Optimization. International Conference on Learning Representations.
The Large Learning Rate Phase of Neural Network Training. arXiv preprint.
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability. International Conference on Learning Representations.
The Break-Even Point on Optimization Trajectories of Deep Neural Networks. International Conference on Learning Representations.