Skip to main content
QUICK REVIEW

[论文解读] The Alignment Bottleneck in Decomposition-Based Claim Verification

Mahmud Elahi Akhter, Federico Ruggeri|arXiv (Cornell University)|Feb 11, 2026
Explainable Artificial Intelligence (XAI)被引用 0
一句话总结

这篇论文表明,将基于分解的论证验证在子论点与颗粒化子证据对齐且信号可靠时才有帮助;否则,尤其在标签噪声较大时,可能降低性能。它引入了一个具有时序证据的真实世界数据集并分析证据对齐及误差传播。

ABSTRACT

Structured claim decomposition is often proposed as a solution for verifying complex, multi-faceted claims, yet empirical results have been inconsistent. We argue that these inconsistencies stem from two overlooked bottlenecks: evidence alignment and sub-claim error profiles. To better understand these factors, we introduce a new dataset of real-world complex claims, featuring temporally bounded evidence and human-annotated sub-claim evidence spans. We evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Our results reveal that decomposition brings significant performance improvement only when evidence is granular and strictly aligned. By contrast, standard setups that rely on repeated claim-level evidence (SRE) fail to improve and often degrade performance as shown across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact). Furthermore, we demonstrate that in the presence of noisy sub-claim labels, the nature of the error ends up determining downstream robustness. We find that conservative "abstention" significantly reduces error propagation compared to aggressive but incorrect predictions. These findings suggest that future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate the label bias of sub-claim verification models.

研究动机与目标

  • 研究子论点分解如何影响真实世界数据中的复杂论证验证。
  • 评估证据对齐的作用(子论点对齐证据 vs 重复的论点级证据)对验证表现的影响。
  • 量化嘈杂子论点标签如何将误差传播到论点级验证。
  • 提供含时序边界证据和人工注释的子论点证据跨度数据集以进行严格评估。
  • 为设计尽量减少误差传播和子论点验证偏差的分解框架提供指导。

提出的方法

  • 将复杂论证分解为带有相关证据和真实性标签的子论点。
  • 在两种证据对齐设定下评估论证验证:SAE(子论点对齐证据)和 SRE(重复的论点级证据)。
  • 使用 Oracle(金标准)子论点标签与嘈杂的(预测的)子论点标签研究误差传播。
  • 对论证级验证使用 Qwen3-14B,子论点验证基线采用 CHEF/BERT 编码器,使用图神经网络(GNN)进行子论点真实性预测。
  • 创建并使用基于 PHEME 的真实世界数据集,含时序界定的证据和人工注释的子论点跨度,以及用于泛化测试的 MMM-Fact 与 COVID-Fact。
Figure 1: shows our annotation and claim verification pipeline and different setups for the study. Oracle_(SAE/SRE) setups use gold sub-claim labels, ablation models do not use any sub-claim labels and noisy setup (not shown in figure) uses predicted sub-claim labels.
Figure 1: shows our annotation and claim verification pipeline and different setups for the study. Oracle_(SAE/SRE) setups use gold sub-claim labels, ablation models do not use any sub-claim labels and noisy setup (not shown in figure) uses predicted sub-claim labels.

实验结果

研究问题

  • RQ1当子论点证据对齐且标签可靠时,论证分解是否能提升验证?
  • RQ2证据对齐(SAE 与 SRE)在不同数据集与领域上对性能有何影响?
  • RQ3嘈杂的子论点标签如何影响下游的论证验证,哪些错误特征最具破坏性?
  • RQ4在证据粒度细与粗时,子论点验证中的误差传播动态为何?
  • RQ5分解收益在真实世界与特定领域事实核查数据集上的泛化性如何?

主要发现

  • 当子论点证据与子论点对齐(SAE)且子论点信号可靠时,分解能带来性能提升。
  • 使用重复的论点级证据(SRE)通常不提升,甚至可能降低性能,特别是在 MMM-Fact 与 COVID-Fact 数据集上。
  • 嘈杂的子论点标签会降低性能,其中 SRE 尤为脆弱;SAE 的稳定性取决于预测器偏差。
  • 在子论点标注中采取保守的放弃策略可减少误差传播,相较于草率但不正确的预测。
  • 在嘈杂标签下,基于 GNN 的子论点真实性预测落后于零-shot 的大语言模型,凸显在数据有限情景下大模型在该任务中的优势。
Table 10: Prompt templates used in our experiments. Oracle SRE uses claims with sub-claims, sub-claim veracity and claim level evidence.
Table 10: Prompt templates used in our experiments. Oracle SRE uses claims with sub-claims, sub-claim veracity and claim level evidence.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。