[论文解读] Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging
本文认为在标准的预训练和评估制度下,语言模型的幻觉是不可避免的,将它们与二元分类错误联系起来,并倡导社会-技术变革以重新评估基准评分以降低它们。
Layer-0 “suppressor” heads explain why LMs trade factuality for hedging. In GPT-2 Medium, ablating heads {0:2, 0:4, 0:7} increases logit-difference (ΔLD) by 0.40–0.85 across four single-token probes and improves calibration (ECE 0.122 → 0.091). Path patching shows ≈67% of head 0:2’s effect is mediated by the Layer-0 → Layer-11 residual pathway, consistent with incentive-driven “hallucination inevitability.” Mistral-7B exhibits an architecture-adapted variant. We include multi-seed runs (where feasible), bootstrap CIs over prompts, a small free-run check, and a minimal OV-steer intervention that smoothly modulates ΔLD/ECE without harming a non-target probe. Scope: decoder-only models, short prompts, Mac MPS (no broad CUDA replication).
研究动机与目标
- Explain why hallucinations arise from the training objective and evaluation setup in large language models.
- Show how pretraining errors reduce to Is-It-Valid classification and derive lower bounds on hallucination rates.
- Analyze post-training persistence of hallucinations under current benchmarks.
- Propose socio-technical mitigation by rethinking benchmark scoring to reduce rewarding uncertainty.
提出的方法
- Establish a reduction from the Is-It-Valid (IIV) binary classification problem to generation, deriving a bound between generative error rate and IIV misclassification rate.
- Generalize the IIV reduction to include prompts and contexts (c, r) with joint distributions and threshold-based classifiers.
- Characterize factors driving base-model errors (arbitrary facts, poor models, calibration) and relate them to hallucination inevitability.
- Provide theoretical results (Theorems 1–4) that connect pretraining and post-training dynamics to hallucination rates under various settings.
- Discuss calibration measures and how cross-entropy training implies small miscalibration parameter delta under typical training.
实验结果
研究问题
- RQ1Why are hallucinations statistically inevitable given standard pretraining objectives and training data?
- RQ2How do prompts and evaluation regimes affect the relationship between true errors and generated hallucinations?
- RQ3What are the statistical factors that drive base-model errors, and how do they translate into hallucination rates?
- RQ4How can benchmark design and scoring be modified to reduce the reinforcement of uncertain or guessed outputs?
主要发现
- Hallucination rates after pretraining are lower-bounded by a function of the fraction of singleton facts in training data and the size of the error set, implying inevitability under realistic data.
- Post-training benchmarks that penalize uncertainty help perpetuate hallucinations, as models optimize for test-taking rather than abstaining when uncertain.
- A calibrated base model can still exhibit hallucinations under standard cross-entropy objectives, and small calibration changes (delta) generally accompany optimal loss.
- Prompted reductions extend the IIV framework and yield analogous bounds, showing the ubiquity of the phenomenon across dialogue settings.
- The work connects supervised binary-classification misclassification with generative errors, offering a novel reduction that does not rely on Transformer specifics.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。