Executive Summary
The deep learning community routinely trains multiple instances of the same architecture with different random seeds to construct ensembles, estimate uncertainty, or validate robustness. This paper demonstrates that for large language models, this practice yields rapidly diminishing returns. Activation fingerprint analysis reveals that independently trained models converge to near-identical internal representations, with cross-seed cosine similarity exceeding 0.999 in stable training regimes.
The convergence is not merely in output behavior but in the geometry of internal activations. Models trained from different initializations traverse different early trajectories but synchronize their phase boundaries — the transitions between chaotic, transition, and stable regimes — within plus or minus 50 gradient steps. By the time training enters the stable regime, the models are functionally identical in activation space.
The practical implications are substantial. Final validation loss coefficients of variation measure 0.41% at 124M parameters, 2.43% at 1.5B, and 1.53% at 7B. The convergence tightens with scale, suggesting that larger models have fewer viable basins in the loss landscape. At 7B parameters, 60-80% of ensemble training compute can be eliminated by recognizing that additional seeds provide negligible diversity in the quantities that matter for downstream performance.
Key Contributions and Methodology
The paper introduces activation fingerprint convergence analysis as a tool for detecting ensemble collapse. Rather than comparing models only at the output level — where different predictions can mask identical internal representations — the method tracks layer-wise activation geometry throughout training. This reveals convergence patterns invisible to standard evaluation metrics.
The experimental design trains multiple seeds across three model scales (GPT-2 124M, Qwen 2.5-1.5B, Qwen 2.5-7B) and tracks cross-seed activation similarity at every checkpoint. Phase boundary synchronization is measured by identifying regime transitions independently for each seed and computing the temporal spread. The coefficient of variation in final validation loss provides a scale-dependent measure of effective ensemble diversity.
The methodology distinguishes between output-level agreement (which can be high even when internal representations differ) and geometric convergence (which indicates that models have found the same solution in activation space). This distinction is critical: output-level ensembles provide value only when component models achieve diversity in their internal representations, not merely in their surface predictions.
Key Findings
- Activation convergence: Cross-seed cosine similarity exceeds 0.999 in stable training regimes across all model scales tested
- Phase synchronization: Regime boundaries (chaotic-to-transition, transition-to-stable) synchronize across seeds within +/-50 gradient steps
- Validation loss convergence: Final validation loss CV of 0.41% (124M), 2.43% (1.5B), and 1.53% (7B) — tightening with scale
- Scale-dependent collapse: Larger models exhibit faster and tighter convergence, consistent with fewer viable loss landscape basins at scale
- Compute savings: 60-80% of ensemble training compute at 7B can be eliminated without meaningful loss of ensemble diversity
Key References
Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. Advances in Neural Information Processing Systems.
Deep Ensembles: A Loss Landscape Perspective. arXiv preprint.
What Is Being Transferred in Transfer Learning? Advances in Neural Information Processing Systems.
Linear Mode Connectivity and the Lottery Ticket Hypothesis. International Conference on Machine Learning.
The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks. International Conference on Learning Representations.