[论文解读] Epistemic Observability in Language Models
本论文证明仅文本的观测无法可靠验证大模型的认知诚实性,并提出一个张量接口,用于输出推理信号(每-token熵、对数概率等),以实现有效验证并提供资源预算的实际成本曲线。
We find that models report highest confidence precisely when they are fabricating. Across four model families (OLMo-3, Llama-3.1, Qwen3, Mistral), self-reported confidence inversely correlates with accuracy, with AUC ranging from 0.28 to 0.36 where 0.5 is random guessing. We prove, under explicit formal assumptions, that this is not a capability gap but an observational one. Under text-only observation, where a supervisor sees only the model's output text, no monitoring system can reliably distinguish honest model outputs from plausible fabrications. We prove two results: first, that any policy conditioning only on the query cannot satisfy epistemic honesty across ambiguous world states; second, that no learning algorithm optimizing reward from a text-only supervisor can converge to honest behavior when the supervisor's observations are identical for both grounded and fabricated responses. Within our formal model, these impossibilities hold regardless of model scale or training procedure, including RLHF and instruction tuning. We construct a tensor interface that escapes the impossibility by exporting computational byproducts (per-token entropy and log-probability distributions) that are structurally coupled to correctness under standard training. Per-token entropy achieves pooled AUC 0.757, outperforming all text baselines by 2.5--3.9 percentage points at every budget level tested (10\%, 20\%, 30\%). The entropy signal generalizes across architectures (Spearman $ρ= 0.762$). The core contribution is a cost surface where the empirical mapping from verification budget (fraction of queries receiving expensive checks) to detection accuracy for each judge strategy is a practical lookup for system builders deciding how to allocate verification resources. The contribution is the map. The territory is the system you are building.
研究动机与目标
- 证明自我报告的自信度在多种模型族中与准确率呈反相关。
- 证明在有限监督下,仅文本的观测不足以将有据支撑输出与编造区分开来。
- 提出一个导出内部推理信号的张量接口,以实现认知验证。
- 实证表明每-token熵提高检测性能,并在不同架构间具有泛化性。
- 提供一个实用的验证成本曲线,帮助系统设计者在资源分配上做出决策。
提出的方法
- 形式化定义一个观测模型,并在模糊世界状态下证明仅文本监督的表征不可能性。
- 引入一个张量接口,除了文本外还能导出每-token熵、对数概率和出处标记。
- 评估四种评判策略(无评判、仅文本、张量引导、组合)在四种架构上以10%、20%、30%的验证预算的表现。
- 使用熵信号的AUC指标衡量检测器性能,并与基于文本的基线进行比较。
- 通过组合图分析验证成本并讨论资源影响。

实验结果
研究问题
- RQ1当世界状态模糊时,面向预测器的仅文本策略是否能满足认知诚实?
- RQ2在带有张量级信号导出的情况下,是否在有界预算下实现可靠的认知验证?
- RQ3在不同模型族中,不同的验证策略在检测性能上有何差异?
- RQ4实际的验证成本是多少,应如何影响系统设计?
主要发现
- 自我报告的自信度在模型造假时最高,在区分造假与真实之间的AUC在四种模型族中介于0.28到0.36。
- 在仅文本观测下,对于模糊世界状态,在有界验证预算下,认知诚实性无法得到保证(定理1与定理2)。
- 每-token熵在合并数据中的AUC达到0.757,在10%、20%、30%预算下相较文本基线提升2.5–3.9个百分点。
- 张量接口在不同架构间具备泛化性(熵信号的Spearman ρ = 0.762)。
- 一个实用的成本曲线将验证预算映射到不同评判策略的检测准确度,帮助资源分配。
- 基于熵的信号不易被操控,因为它们与基础计算相关,而不仅仅是训练影响的文本模式。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。