Research · Per Ardua

Entangled Directions: Concept-Pure Discrimination Geometry Masks Universal Activation Entanglement

Every direction carries every concept, even concept-pure ones

AI-25 Activation Geometry DOI

Executive Summary

Linear probing assumes that concept-separability in the classifier implies concept-separability in the activations. This paper shows that assumption is false. Using multi-concept ridge regression with SVD decomposition on Qwen 2.5-7B, we find that directions can be concept-pure for discrimination (V-matrix purity greater than 0.96) while simultaneously carrying all concepts in their activations. The V-matrix shows what the classifier uses each direction for; the damage matrix shows what each direction actually carries.

The damage matrix is constructed by projecting out each SVD direction and measuring leave-one-out classification accuracy loss for all concepts. The minimum cross-concept damage is 38.8% across all direction-concept pairs. No direction can be removed without damaging every concept. This discrimination-activation dissociation means that INLP and related methods that identify concept-pure directions for intervention are operating on the wrong geometry: they find directions that discriminate one concept, then assume those directions carry only that concept.

This establishes a fundamental limitation on direction-based concept editing. The geometry that supports classification is not the geometry that carries information. Any intervention that removes a concept-pure direction will damage all concepts, not just the targeted one.

Key Findings

  • V-matrix purity greater than 0.96: SVD directions are concept-pure for discrimination — the classifier uses each direction for a single concept
  • Minimum cross-concept damage 38.8%: Every direction, when removed, damages every concept — activations are universally entangled
  • Discrimination-activation dissociation: What a direction classifies and what it carries are different geometries
  • Damage matrix as measurement tool: Leave-one-out projection reveals the true information content of each direction
  • INLP limitation established: Concept-pure directions for discrimination are not concept-pure for information — intervention on them causes collateral damage

Key References

  • Ravfogel et al. (2020) — INLP: Iterative Null-Space Projection for concept erasure
  • McEntire (2026) — The Concentration Barrier (AI-11): effective dimensionality bounds on selectivity
  • McEntire (2026) — Universal Entanglement in the Informative Subspace (AI-26): cross-model empirical validation
  • McEntire (2026) — The Entanglement Theorem (AI-27): formal proof that entanglement is geometric

Download Full Paper

Access the complete research paper with detailed methodology, empirical evidence, and formal proofs.

Download PDF