Executive Summary
Model distillation — training a smaller model to replicate a larger model's behavior using its outputs — poses a growing security challenge for API-based AI providers. Traditional detection approaches attempt to identify distillation from the input side, classifying individual queries as legitimate or extractive. The Online Model Extraction Detection (OMED) impossibility theorem establishes that this approach cannot succeed: any individual query that a distillation attacker would make is also a query a legitimate user would make, making per-query classification provably impossible.
This paper sidesteps the OMED impossibility by monitoring the model's activation space rather than its inputs. Distillation attacks, regardless of their query strategy, produce distinctive geometric signatures in the victim model's activation manifold — signatures that legitimate usage patterns do not produce. By tracking these geometric invariants, the framework achieves AUC 1.000 on systematic distillation attacks at model scales of 1.5B parameters and above, with perfect separation between attack and legitimate usage distributions.
The work is motivated in part by the February 2026 Anthropic disclosure, which documented over 16 million exchanges from approximately 24,000 fraudulent accounts engaged in systematic model extraction. The framework developed here provides continuous Proof-of-Humanity scoring as an alternative to binary authentication, creating an economic deterrent through a coverage-clustering tradeoff: attackers must choose between broad capability extraction (which produces detectable geometric signatures) and narrow extraction (which limits the utility of the distilled model).
Key Contributions and Methodology
The paper makes three contributions. First, it formalizes the distinction between input-side and activation-side detection, showing that the OMED impossibility applies only to the former. Activation-side detection exploits the fact that a model's internal geometry responds differently to systematic capability probing than to organic usage, even when the individual queries are identical.
Second, it develops a suite of topological and geometric metrics — including persistent homology summaries, activation coverage maps, and manifold curvature estimates — that collectively form a distillation fingerprint. These metrics are computed continuously over sliding windows of queries, producing a time-evolving Proof-of-Humanity score that degrades as extraction patterns accumulate.
Third, the paper formalizes the coverage-clustering tradeoff as an economic mechanism. Effective distillation requires broad coverage of the victim model's capability manifold, but broad coverage produces clustered activation patterns that are geometrically distinguishable from the diffuse patterns of legitimate usage. Attackers who avoid clustering must sacrifice coverage, reducing the value of their extraction. This creates a structural deterrent that does not depend on catching any individual query.
Key Findings
- OMED bypass: Activation-side monitoring sidesteps the OMED impossibility theorem by detecting geometric patterns rather than classifying individual queries
- Perfect detection at scale: AUC 1.000 on systematic distillation attacks at model scales of 1.5B parameters and above
- Continuous scoring: Proof-of-Humanity scoring provides continuous rather than binary classification, with perfect separation between attack and legitimate distributions
- Coverage-clustering tradeoff: Attackers face a fundamental tradeoff between extraction breadth (detectable) and extraction stealth (limited utility), creating an economic deterrent
- Motivated by real attacks: Framework addresses the scale of real-world extraction documented in the February 2026 Anthropic disclosure (16M+ exchanges, ~24K fraudulent accounts)
Key References
Stealing Machine Learning Models via Prediction APIs. USENIX Security Symposium.
Knockoff Nets: Stealing Functionality of Black-Box Models. IEEE Conference on Computer Vision and Pattern Recognition.
PRADA: Protecting Against DNN Model Stealing Attacks. IEEE European Symposium on Security and Privacy.
Stealing Part of a Production Language Model. International Conference on Machine Learning.
Disclosure: Systematic Model Extraction via Fraudulent API Accounts. Anthropic Security Report.