QUICK REVIEW

[论文解读] VeriTaS: The First Dynamic Benchmark for Multimodal Automated Fact-Checking

Mark Rothermel, Marcus Kornmann|arXiv (Cornell University)|Jan 13, 2026

Misinformation and Its Impacts被引用 0

一句话总结

VeriTaS 引入一个按季度更新的动态基准，用于多模态自动事实核查，使用真实世界的多语言声明、解耦的、不确定性感知的评分及文本理由来抵御数据泄漏。

ABSTRACT

The growing scale of online misinformation urgently demands Automated Fact-Checking (AFC). Existing benchmarks for evaluating AFC systems, however, are largely limited in terms of task scope, modalities, domain, language diversity, realism, or coverage of misinformation types. Critically, they are static, thus subject to data leakage as their claims enter the pretraining corpora of LLMs. As a result, benchmark performance no longer reliably reflects the actual ability to verify claims. We introduce Verified Theses and Statements (VeriTaS), the first dynamic benchmark for multimodal AFC, designed to remain robust under ongoing large-scale pretraining of foundation models. VeriTaS currently comprises 24,000 real-world claims from 108 professional fact-checking organizations across 54 languages, covering textual and audiovisual content. Claims are added quarterly via a fully automated seven-stage pipeline that normalizes claim formulation, retrieves original media, and maps heterogeneous expert verdicts to a novel, standardized, and disentangled scoring scheme with textual justifications. Through human evaluation, we demonstrate that the automated annotations closely match human judgments. We commit to update VeriTaS in the future, establishing a leakage-resistant benchmark, supporting meaningful AFC evaluation in the era of rapidly evolving foundation models. We will make the code and data publicly available.

研究动机与目标

通过提供动态、抗泄漏的评估平台来解决静态 AFC 基准的泄漏问题。
提供真实世界的多模态（文本、图像、视频）声明，覆盖多语言并给出专家级真实 verdicts。
提供细粒度、具备不确定性感知的评分方案，将媒介属性和真实性属性解耦。
通过七阶段管线实现数据收集与标注的自动化，以支持直至 2028 年的季度更新。
在当前 VeriTaS 数据上展示最新的多模态大型语言模型的基线性能差距。

提出的方法

七阶段自动化管线，用于从 ClaimReview 构建并标注声明：发现、出版源验证、文章抓取、媒介出现检索、声明归一化、 verdict 标准化、及更正。
利用 GPT-5 系列和 Gemini 系列的 LLM 进行提取、改写与理由生成，采用少量样本学习和链式思考提示。
将 verdict 解耦为五个属性（Media Authenticity、Media Contextualization、Veracity、Context Coverage、Integrity），在 -1 到 1 的尺度上打分。
对四个 LLM 进行集成，汇聚预测并提供多注释者的理由。
以均方误差/平均绝对误差指标对比人类判断，验证自动标注的准确性。
在最新的 VeriTaS 划分上基准测试近期的 AFC 系统（多模态 LLM 与 AFC 基线），并分析知识截止效应。

实验结果

研究问题

RQ1动态、按季度更新的基准是否能在基础模型持续预训练的过程中保持鲁棒性？
RQ2真实世界的多语言多模态声明，配合解耦评分与理由，是否提升评估的现实性与可靠性？
RQ3当前多模态 LLM 与 VeriTaS 的验证任务之间存在的性能差距，尤其在知识截止日期之后？
RQ4不确定性感知的分级 verdict 属性与人类对声明完整性的判断之间的相关性？
RQ5维护一个抗泄漏 AFC 基准的实际计算与伦理考量有哪些？

主要发现

Method	MSE (↓)	MAE (↓)	Acc. (↑)	Notes
Gemini 2.0 Flash	-	0.74	0.71	32.1
Gemini 2.5 Flash	-	0.85	0.57	65.9
Gemini 3 Pro	-	0.55	0.37	81.9
GPT-4o	-	0.65	0.65	36.9
GPT-5.2	-	0.70	0.69	33.5
Llama 4 Maverick	-	0.97	0.74	41.8
Gemini 2.0 Flash	✓	0.73	0.57	58.0
Gemini 2.5 Flash	✓	0.68	0.48	71.2
Gemini 3 Pro	✓	0.39	0.35	74.6
GPT-4o	✓	0.65	0.50	64.2
GPT-5.2	✓	0.45	0.40	70.6
Llama 4 Maverick	✓	1.04	0.72	49.6
DEFAME (w/ GPT-5.2 )	✓	0.55	0.49	60.4
Loki (w/ GPT-5.2 )	✓	0.86	0.59	61.8

VeriTaS 包含 24,000 条真实世界声明，覆盖 54 种语言，包含图像和视频，每季度更新。
人工评估表明自动标注与人工判断高度吻合（MSE ≤ 0.04）。
基线多模态 LLM 在当前 VeriTaS 数据上的提升空间仍然很大，尚无模型接近完美表现。
知识截止效应在纵向划分上显著降低模型的 MSE，表明静态基准存在泄漏。
在所评估的模型中，Gemini 3 Pro 搭载检索在基线中获得最佳 MSE（0.39），但仍远未理想。
纵向与动态设计减少数据泄漏，为直至 2028 年的持续评估框架提供了现实的基准。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。