Skip to main content
QUICK REVIEW

[论文解读] Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Kalai, Adam Tauman, Nachum, Ofir|ArXiv.org|Sep 4, 2025
Topic Modeling被引用 14
一句话总结

本文认为在标准的预训练和评估制度下,语言模型的幻觉是不可避免的,将它们与二元分类错误联系起来,并倡导社会-技术变革以重新评估基准评分以降低它们。

ABSTRACT

Layer-0 “suppressor” heads explain why LMs trade factuality for hedging. In GPT-2 Medium, ablating heads {0:2, 0:4, 0:7} increases logit-difference (ΔLD) by 0.40–0.85 across four single-token probes and improves calibration (ECE 0.122 → 0.091). Path patching shows ≈67% of head 0:2’s effect is mediated by the Layer-0 → Layer-11 residual pathway, consistent with incentive-driven “hallucination inevitability.” Mistral-7B exhibits an architecture-adapted variant. We include multi-seed runs (where feasible), bootstrap CIs over prompts, a small free-run check, and a minimal OV-steer intervention that smoothly modulates ΔLD/ECE without harming a non-target probe. Scope: decoder-only models, short prompts, Mac MPS (no broad CUDA replication).

研究动机与目标

  • Explain why hallucinations arise from the training objective and evaluation setup in large language models.
  • Show how pretraining errors reduce to Is-It-Valid classification and derive lower bounds on hallucination rates.
  • Analyze post-training persistence of hallucinations under current benchmarks.
  • Propose socio-technical mitigation by rethinking benchmark scoring to reduce rewarding uncertainty.

提出的方法

  • Establish a reduction from the Is-It-Valid (IIV) binary classification problem to generation, deriving a bound between generative error rate and IIV misclassification rate.
  • Generalize the IIV reduction to include prompts and contexts (c, r) with joint distributions and threshold-based classifiers.
  • Characterize factors driving base-model errors (arbitrary facts, poor models, calibration) and relate them to hallucination inevitability.
  • Provide theoretical results (Theorems 1–4) that connect pretraining and post-training dynamics to hallucination rates under various settings.
  • Discuss calibration measures and how cross-entropy training implies small miscalibration parameter delta under typical training.

实验结果

研究问题

  • RQ1Why are hallucinations statistically inevitable given standard pretraining objectives and training data?
  • RQ2How do prompts and evaluation regimes affect the relationship between true errors and generated hallucinations?
  • RQ3What are the statistical factors that drive base-model errors, and how do they translate into hallucination rates?
  • RQ4How can benchmark design and scoring be modified to reduce the reinforcement of uncertain or guessed outputs?

主要发现

  • Hallucination rates after pretraining are lower-bounded by a function of the fraction of singleton facts in training data and the size of the error set, implying inevitability under realistic data.
  • Post-training benchmarks that penalize uncertainty help perpetuate hallucinations, as models optimize for test-taking rather than abstaining when uncertain.
  • A calibrated base model can still exhibit hallucinations under standard cross-entropy objectives, and small calibration changes (delta) generally accompany optimal loss.
  • Prompted reductions extend the IIV framework and yield analogous bounds, showing the ubiquity of the phenomenon across dialogue settings.
  • The work connects supervised binary-classification misclassification with generative errors, offering a novel reduction that does not rely on Transformer specifics.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。