Entangled Directions: Concept-Pure Discrimination Geometry Masks Universal Activation Entanglement

This paper has been superseded

This paper has been consolidated into Universal Entanglement in Transformer Activation Space (DOI: 10.5281/zenodo.19409951). The content below is retained for reference.

Executive Summary

Linear probing assumes that concept-separability in the classifier implies concept-separability in the activations. This paper shows that assumption is false. Using multi-concept ridge regression with SVD decomposition on Qwen 2.5-7B, we find that directions can be concept-pure for discrimination (V-matrix purity greater than 0.96) while simultaneously carrying all concepts in their activations. The V-matrix shows what the classifier uses each direction for; the damage matrix shows what each direction actually carries.

The damage matrix is constructed by projecting out each SVD direction and measuring leave-one-out classification accuracy loss for all concepts. The minimum cross-concept damage is 38.8% across all direction-concept pairs. No direction can be removed without damaging every concept. This discrimination-activation dissociation means that INLP and related methods that identify concept-pure directions for intervention are operating on the wrong geometry: they find directions that discriminate one concept, then assume those directions carry only that concept.

This establishes a fundamental limitation on direction-based concept editing. The geometry that supports classification is not the geometry that carries information. Any intervention that removes a concept-pure direction will damage all concepts, not just the targeted one.

Key Findings

V-matrix purity greater than 0.96: SVD directions are concept-pure for discrimination — the classifier uses each direction for a single concept
Minimum cross-concept damage 38.8%: Every direction, when removed, damages every concept — activations are universally entangled
Discrimination-activation dissociation: What a direction classifies and what it carries are different geometries
Damage matrix as measurement tool: Leave-one-out projection reveals the true information content of each direction
INLP limitation established: Concept-pure directions for discrimination are not concept-pure for information — intervention on them causes collateral damage

Key References

Ravfogel et al. (2020) — INLP: Iterative Null-Space Projection for concept erasure
McEntire (2026) — The Concentration Barrier (AI-11): effective dimensionality bounds on selectivity
McEntire (2026) — Universal Entanglement in the Informative Subspace (AI-26): cross-model empirical validation
McEntire (2026) — The Entanglement Theorem (AI-27): formal proof that entanglement is geometric

Executive Summary

Key Findings

Key References

Download Full Paper