[論文レビュー] Epistemic Observability in Language Models
この論文は、テキストのみの観察がLLMにおける知識的正直さの信頼できる検証を妨げることを証明し、推論信号(トークンごとのエントロピー、対数確率)をエクスポートするテンソルインタフェースを提案して、効果的な検証とリソース予算のための実用的なコスト表を可能にする。
We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model's output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor's observations are identical for both grounded and fabricated responses. Within our formal model, these impossibilities hold regardless of model scale or training procedure, including RLHF and instruction tuning. We construct a tensor interface that escapes the impossibility by exporting computational byproducts (per-token entropy and log-probability distributions) that are structurally coupled to correctness under standard training. Per-token entropy achieves pooled AUC 0.757, outperforming all text baselines by 2.5--3.9 percentage points at every budget level tested (10\%, 20\%, 30\%). The entropy signal generalizes across architectures (Spearman $ρ= 0.762$). The core contribution is a cost surface where the empirical mapping from verification budget (fraction of queries receiving expensive checks) to detection accuracy for each judge strategy is a practical lookup for system builders deciding how to allocate verification resources. The contribution is the map. The territory is the system you are building.
研究の動機と目的
- Demonstrate that self-reported confidence inversely correlates with accuracy across multiple model families.
- Prove that text-only observation is insufficient to distinguish grounded outputs from fabrications under bounded supervision.
- Propose a tensor interface that exports internal inference signals to enable epistemic verification.
- Show empirically that per-token entropy improves detection performance and generalizes across architectures.
- Provide a practical verification cost surface to guide system designers in allocating verification resources.
提案手法
- Formally define an observation model and prove representational impossibility for text-only supervision under ambiguous world states.
- Introduce a tensor interface that exports per-token entropy, log-probabilities, and provenance markers alongside text.
- Evaluate four judge strategies (no judge, text-only, tensor-guided, composed) at 10%, 20%, and 30% verification budgets across four architectures.
- Measure detector performance using AUC metrics for entropy-based signals and compare to text-based baselines.
- Analyze the cost of verification via the composition graph and discuss resource implications.

実験結果
リサーチクエスチョン
- RQ1Can a predictor-centric, text-only policy satisfy epistemic honesty when world states are ambiguous?
- RQ2Does exporting tensor-level signals alongside text enable reliable epistemic verification under bounded budgets?
- RQ3How do different verification strategies compare in detection performance across model families?
- RQ4What is the practical verification cost and how should it inform system design?
主な発見
- Self-reported confidence is highest when models fabricate, with AUC for distinguishing fabrication vs. ground truth between 0.28 and 0.36 across four model families.
- Under text-only observation, epistemic honesty cannot be guaranteed for ambiguous world states under bounded verification budgets (Theorem 1 and Theorem 2).
- Per-token entropy achieves pooled AUC of 0.757, outperforming text baselines by 2.5–3.9 percentage points at 10%, 20%, and 30% budgets.
- The tensor interface generalizes across architectures (Spearman ρ = 0.762 for entropy signals).
- A practical cost surface maps verification budget to detection accuracy for different judge strategies, guiding resource allocation.
- Entropy-based signals resist manipulation because they are tied to underlying computation, not solely to training-influenced text patterns.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。