QUICK REVIEW

[论文解读] Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging

Kalai, Adam Tauman, Nachum, Ofir|ArXiv.org|Sep 4, 2025

Topic Modeling被引用 14

一句话总结

本文认为在标准的预训练和评估制度下，语言模型的幻觉是不可避免的，将它们与二元分类错误联系起来，并倡导社会-技术变革以重新评估基准评分以降低它们。

ABSTRACT

Layer-0 “suppressor” heads explain why LMs trade factuality for hedging. In GPT-2 Medium, ablating heads {0:2, 0:4, 0:7} increases logit-difference (ΔLD) by 0.40–0.85 across four single-token probes and improves calibration (ECE 0.122 → 0.091). Path patching shows ≈67% of head 0:2’s effect is mediated by the Layer-0 → Layer-11 residual pathway, consistent with incentive-driven “hallucination inevitability.” Mistral-7B exhibits an architecture-adapted variant. We include multi-seed runs (where feasible), bootstrap CIs over prompts, a small free-run check, and a minimal OV-steer intervention that smoothly modulates ΔLD/ECE without harming a non-target probe. Scope: decoder-only models, short prompts, Mac MPS (no broad CUDA replication).

研究动机与目标

Explain why hallucinations arise from the training objective and evaluation setup in large language models.
Show how pretraining errors reduce to Is-It-Valid classification and derive lower bounds on hallucination rates.
Analyze post-training persistence of hallucinations under current benchmarks.
Propose socio-technical mitigation by rethinking benchmark scoring to reduce rewarding uncertainty.

提出的方法

Establish a reduction from the Is-It-Valid (IIV) binary classification problem to generation, deriving a bound between generative error rate and IIV misclassification rate.
Generalize the IIV reduction to include prompts and contexts (c, r) with joint distributions and threshold-based classifiers.
Characterize factors driving base-model errors (arbitrary facts, poor models, calibration) and relate them to hallucination inevitability.
Provide theoretical results (Theorems 1–4) that connect pretraining and post-training dynamics to hallucination rates under various settings.
Discuss calibration measures and how cross-entropy training implies small miscalibration parameter delta under typical training.

实验结果

研究问题

RQ1Why are hallucinations statistically inevitable given standard pretraining objectives and training data?
RQ2How do prompts and evaluation regimes affect the relationship between true errors and generated hallucinations?
RQ3What are the statistical factors that drive base-model errors, and how do they translate into hallucination rates?
RQ4How can benchmark design and scoring be modified to reduce the reinforcement of uncertain or guessed outputs?

主要发现

Hallucination rates after pretraining are lower-bounded by a function of the fraction of singleton facts in training data and the size of the error set, implying inevitability under realistic data.
Post-training benchmarks that penalize uncertainty help perpetuate hallucinations, as models optimize for test-taking rather than abstaining when uncertain.
A calibrated base model can still exhibit hallucinations under standard cross-entropy objectives, and small calibration changes (delta) generally accompany optimal loss.
Prompted reductions extend the IIV framework and yield analogous bounds, showing the ubiquity of the phenomenon across dialogue settings.
The work connects supervised binary-classification misclassification with generative errors, offering a novel reduction that does not rely on Transformer specifics.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。