[論文レビュー] Layer-0 Suppressors Ground Hallucination Inevitability: A Mechanistic Account of How Transformers Trade Factuality for Hedging
この論文は、言語モデルの幻覚が標準的な事前学習および評価制度の下で避けられないものであると主張し、それらを二値分類エラーに結びつけ、社会技術的なベンチマークスコアリングの変更を提案して幻覚を抑制するべきだと訴えている。
Layer-0 “suppressor” heads explain why LMs trade factuality for hedging. In GPT-2 Medium, ablating heads {0:2, 0:4, 0:7} increases logit-difference (ΔLD) by 0.40–0.85 across four single-token probes and improves calibration (ECE 0.122 → 0.091). Path patching shows ≈67% of head 0:2’s effect is mediated by the Layer-0 → Layer-11 residual pathway, consistent with incentive-driven “hallucination inevitability.” Mistral-7B exhibits an architecture-adapted variant. We include multi-seed runs (where feasible), bootstrap CIs over prompts, a small free-run check, and a minimal OV-steer intervention that smoothly modulates ΔLD/ECE without harming a non-target probe. Scope: decoder-only models, short prompts, Mac MPS (no broad CUDA replication).
研究の動機と目的
- Explain why hallucinations arise from the training objective and evaluation setup in large language models.
- Show how pretraining errors reduce to Is-It-Valid classification and derive lower bounds on hallucination rates.
- Analyze post-training persistence of hallucinations under current benchmarks.
- Propose socio-technical mitigation by rethinking benchmark scoring to reduce rewarding uncertainty.
提案手法
- Establish a reduction from the Is-It-Valid (IIV) binary classification problem to generation, deriving a bound between generative error rate and IIV misclassification rate.
- Generalize the IIV reduction to include prompts and contexts (c, r) with joint distributions and threshold-based classifiers.
- Characterize factors driving base-model errors (arbitrary facts, poor models, calibration) and relate them to hallucination inevitability.
- Provide theoretical results (Theorems 1–4) that connect pretraining and post-training dynamics to hallucination rates under various settings.
- Discuss calibration measures and how cross-entropy training implies small miscalibration parameter delta under typical training.
実験結果
リサーチクエスチョン
- RQ1 Why are hallucinations statistically inevitable given standard pretraining objectives and training data?
- RQ2 How do prompts and evaluation regimes affect the relationship between true errors and generated hallucinations?
- RQ3 What are the statistical factors that drive base-model errors, and how do they translate into hallucination rates?
- RQ4 How can benchmark design and scoring be modified to reduce the reinforcement of uncertain or guessed outputs?
主な発見
- Hallucination rates after pretraining are lower-bounded by a function of the fraction of singleton facts in training data and the size of the error set, implying inevitability under realistic data.
- Post-training benchmarks that penalize uncertainty help perpetuate hallucinations, as models optimize for test-taking rather than abstaining when uncertain.
- A calibrated base model can still exhibit hallucinations under standard cross-entropy objectives, and small calibration changes (delta) generally accompany optimal loss.
- Prompted reductions extend the IIV framework and yield analogous bounds, showing the ubiquity of the phenomenon across dialogue settings.
- The work connects supervised binary-classification misclassification with generative errors, offering a novel reduction that does not rely on Transformer specifics.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。