Research · Per Ardua

GenAI Is Socially Awkward: RLHF Instruction Tuning Damages Social Cognition at Small Scale by Suppressing Pragmatic Inference

RLHF makes small models Sheldon Cooper

AI-7 Activation Geometry DOI

Executive Summary

We measure social cognition in language models by testing their ability to select socially appropriate responses in controlled multiple-choice scenarios. Across 50 position-randomized vignettes evaluated on nine model-condition pairs spanning 7B to 72B parameters, we find that RLHF instruction tuning hurts social cognition at 7B (-18 percentage points, from 72% to 54%) but helps at 72B (+6 points, from 84% to 90%).

The deficit is not one of knowledge but of processing mode: instruction-tuned models at small scale optimize for literal compliance at the expense of the fuzzy contextual pattern matching that social reasoning requires. Forcing models to explain their reasoning costs an additional ~8 points at both scales — a rationalization bias in which explicit deliberation overrides correct intuitive judgment.

The result reframes RLHF alignment not as a capability enhancement but as a noise-tolerance tradeoff: compliance training reduces the variance that social cognition depends on.

The Sheldon Cooper Effect

Consider two characters. One has encyclopedic knowledge, answers questions with mechanical precision, and consistently misreads the room. The other has less formal training but navigates social situations with intuitive ease. This paper shows that RLHF instruction tuning produces exactly this pattern in 7-billion-parameter language models.

The instruction-tuned Mistral 7B scores 54% on social cognition vignettes where its base counterpart scores 72%. The tuning that makes the model more helpful, harmless, and precise also makes it worse at reading social context. Not because it lacks the knowledge — a stochastic resonance probe confirms the signal is present — but because compliance training narrows the distribution over which the model reasons.

The effect reverses at scale. At 72B parameters, instruction tuning helps: 90% versus 84% for the base model. The larger model has enough capacity for both precise compliance and contextual social reasoning. The smaller model does not, and compliance wins.

Key Findings

  • RLHF damages social cognition at 7B: Mistral 7B instruct scores 54% vs. 72% base (-18 points). At 72B, instruction tuning helps: 90% vs. 84% (+6 points)
  • Rationalization bias: Prompting models to explain reasoning costs ~8 points at both scales. Explicit deliberation overwrites correct intuitive judgment. Letter-only prompts recover most of the loss (62% at 7B, 98% at 72B)
  • Stochastic resonance confirms suppressed signal: Adding noise to 7B instruct recovers +4 points (baseline 58%, peak 62% at T=1.0), while 7B base shows pure monotonic degradation from 86%. Noise rescues performance only where an intact but suppressed signal exists
  • C1-gated stochastic resonance: Pattern consistent with the sufficient conditions from the Communicative Variance paper — noise benefit requires an intact but suboptimally accessed signal
  • Scale resolves the tradeoff: At 72B, the model has sufficient capacity for both instruction-following and social inference. No rescue needed

Methodology

50 social cognition vignettes with position-randomized multiple-choice responses, evaluated across nine model-condition pairs. Models: Mistral 7B (base + instruct), Qwen 72B (base + instruct), Mixtral 8x7B. Three prompt modes: standard, chain-of-thought (explain reasoning), and letter-only. Stochastic resonance temperature sweep from T=0.5 to T=2.0 to probe whether the social signal is absent or suppressed.

This is a pragmatic reasoning gap, not an emotional one. We measure whether models can identify what a socially competent human would do — reading implicature, detecting deflection, recognizing face-saving behavior, and inferring unstated social dynamics.

Key References

  • McEntire (2026) — Communicative Variance: sufficient conditions for stochastic resonance benefit (C1 gating)
  • McEntire (2026) — Structural Transfer: domain-invariant structural signatures in activation space
  • McEntire (2026) — Constellation Composition: stochastic resonance at sigma=0.020 in model composition
  • Ouyang et al. (2022) — Training language models to follow instructions with human feedback

Download Full Paper

Access the complete research paper with detailed methodology, empirical evidence, and formal proofs.

Download PDF