Training Trajectory Structure: Regime Detection, Cross-Seed Convergence, and Speculative Weight Prediction

Executive Summary

Training a neural network is not a smooth descent. The optimization trajectory alternates between stable basins where gradients are predictable and chaotic transitions where they are not. This consolidated paper presents evidence that these regimes are detectable in real time using activation fingerprints, synchronized across independent training runs, and exploitable for speculative weight prediction.

Using 100 fixed probe inputs to capture activation snapshots at 40 checkpoints across GPT-2 124M (5 seeds), Qwen 2.5-1.5B (4-5 seeds), and Qwen 2.5-7B (5 seeds), three results are established. First, a three-regime taxonomy (chaotic, transition, stable) based on thresholded cosine similarity generalizes across two orders of magnitude in model scale. Second, independently initialized seeds converge to nearly identical final states with phase boundaries synchronized to within +/-50 gradient steps. Third, speculative weight prediction using finite-difference extrapolation achieves 60-90% strict acceptance at 7B in stable regimes, while momentum-based prediction fails catastrophically at all scales with 100-10,000x loss inflation.

Three-Regime Taxonomy (from former AI-1)

Training dynamics partition into three regimes detectable via a single scalar metric: the cosine similarity between consecutive activation fingerprints. Chaotic regimes show rapid representational change (similarity 0.976 at 7B). Stable regimes show near-stationarity (similarity 0.9999 at 7B). The opposing trends -- chaotic similarity decreasing with scale, stable similarity increasing -- are consistent with a funnel-shaped loss landscape: larger models explore more broadly during chaos but converge more tightly once settled.

Cross-Seed Convergence (from former AI-2)

Independently initialized seeds converge to the same final validation loss: CV of 0.41% (124M), 2.43% (1.5B), 1.53% (7B). Phase boundaries are synchronized across seeds -- all five GPT-2 seeds enter transition at exactly step 100 (zero variance). Cross-seed regime agreement reaches 96.9% at 1.5B and 93.0% at 7B. At 7B, a catastrophic outlier (seed 44, 2x loss spike) is fully absorbed by step 500, demonstrating the strength of the landscape's attractor basin.

Universal Momentum Catastrophe

Momentum-based weight prediction -- extrapolating Adam's exponential moving averages -- fails catastrophically at every scale tested. At K=5, predicted losses are 122-227x higher than actual. At K=100, inflation reaches 2,678-10,764x. The displacement grows linearly with K but the region of validity does not. This finding has implications for weight nowcasting methods (WNN, NiNo, XGrad): optimizer-state extrapolation is fundamentally unsuitable for speculative weight prediction.

Regime-Dependent Prediction Success

Finite-difference predictors succeed where momentum fails, but only in favorable regimes. At 7B in stable regime, linear prediction achieves 60-90% strict acceptance with no degradation at prediction depths up to 100 steps. Linear outperforms quadratic (lower cross-seed variance). Acceptance rates increase with model scale within comparable regimes. Adaptive acceptance (within one standard deviation) is 100% across all K values and seeds at 7B.

Key Findings

Three-regime taxonomy: Chaotic, transition, and stable regimes detected via activation fingerprint cosine similarity, validated across 124M-7B
Cross-seed convergence: Validation loss CV 0.41-2.43%, phase boundaries synchronized to +/-50 steps
Universal momentum catastrophe: Adam extrapolation produces 100-10,000x loss inflation at all scales -- qualitative, not marginal, failure
Regime-dependent prediction: Linear extrapolation achieves 60-90% strict acceptance at 7B in stable regime
Scale improves prediction: Acceptance rates increase with model scale within comparable regimes
No wall-clock speedup claimed: Acceptance rates demonstrate exploitable structure; whether prediction is cheaper than K gradient steps remains open

Limitations

All experiments use 2000 training steps on WikiText-103 -- toy scale relative to production training
Scaling claims rest on three data points (124M, 1.5B, 7B) from two architecture families; no claim of extrapolation to frontier scale
Regime thresholds calibrated on GPT-2 and carried forward without recalibration; specific values not reported
Cross-seed fingerprint comparison at matched checkpoints not directly measured
No wall-clock speedup measurement

Key References

Waterland, A., et al. (2014)

ASC: Automatically Scalable Computation. ASPLOS '14.

Garipov, T., et al. (2018)

Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs. NeurIPS.

Ainsworth, S. K., et al. (2023)

Git Re-Basin: Merging Models Modulo Permutation Symmetries. ICLR.

Frankle, J. & Carlin, M. (2019)

The Lottery Ticket Hypothesis. ICLR.

Cohen, J. M., et al. (2021)

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability. ICLR.