QUICK REVIEW

[论文解读] Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Chong Tian, Yu Wang|arXiv (Cornell University)|Mar 16, 2026

Misinformation and Its Impacts被引用 0

一句话总结

MAGIC 3 引入一种跨模态一致性视角来分析短视频中的假新闻，结合文本–视觉–音频信号与基于不确定性的 VLM 路由，实现高吞吐量下的高准确性。

ABSTRACT

Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.

研究动机与目标

在短视频中，若每种模态单独看起来都很合理，但它们的对齐却不一致时，激励检测多模态错误信息。
表征跨模态一致性模式（文本–视觉、文本–音频、视觉–音频）并识别一个可解释的全局一致性轴。
开发一个轻量、可解释的探测器，暴露多粒度的一致性信号（两两、全局、令牌/帧级）和不确定性，以指导高效检测。
实现一个两阶段路由系统，利用一致性和不确定性来判断何时调用重量级的视觉–语言模型（VLMs）。
通过多风格的大语言模型改写来提高文本表示和风格扰动鲁棒性，以实现鲁棒性。

提出的方法

通过跨模态一致性门控（CMCG）计算显式跨模态一致性，获得成对的一致性分数和全局一致性分数。
从跨模态注意力中推导令牌级和帧级的一致性场（Consistency Field，CF）使用一致性估计器（CFE）。
加入时间跨模态不一致性（TCMI），捕捉随时间的音频–视频错位。
使用对抗感知改写融合（AARF）将原始文本与多风格LLM改写融合，得到风格鲁棒的表示。
采用带一致性加权跨注意力的分层多模态变换器（HMT），实现全局视频表示。
使用对比–对抗联合学习（CAJL）进行训练，结合监督损失、内部/跨模态对比损失、对抗一致性正则化和一致性正则项。

Figure 1: Illustration of cross-modal consistency patterns. In real news short videos, text, visuals, and audio are contextually aligned (Consistent). In fake news, a “semantic gap” often exists between the sensational claims (text/audio) and the actual visual content. MAGIC 3 acts as a consistency

实验结果

研究问题

RQ1哪些跨模态一致性模式区分真实与假短视频？
RQ2一个轻量级探测器是否能暴露与假概率和预测难度相关的多粒度一致性信号？
RQ3引入多风格LLM改写是否能提升对假新闻检测中的风格扰动鲁棒性？
RQ4基于不确定性导向的重量级VLM路由是否能在更高吞吐量下达到VLM级准确性？
RQ5令牌/帧级一致性场与时间不一致性如何帮助定位错位信号？

主要发现

真实视频表现出较高的文本–视觉一致性和中等的文本–音频一致性，而假视频则呈现相反模式（文本–音频高，文本–视觉低）。
单一全局一致性分数与预测难度相关，并在中间值处聚类预测错误。
基于不确定性和全局一致性的两阶段路由使约25%的样本路由到VLM，同时保持竞争性准确性且吞吐量显著提升。
MAGIC 3 在使用冻结特征时在 FakeSV 和 FakeTT 上达到最先进的有监督性能；与重量级VLM结合时，超越仅VLM检测器，吞吐量提升18–27倍。
通过AARF进行多风格LLM改写提升鲁棒性；移除AARF后性能下降，尤其在FakeTT上。
消融研究表明核心一致性模块（CMCG、CFE、TCMI）对性能至关重要。）

Figure 2: MAGIC 3 Overview. Frozen encoders provide text, visual, audio, and rewrite features. The Cross-Modal Consistency Gate outputs pairwise and global consistency scores; Consistency Field Estimator converts cross-modal attention into token- and frame-level consistency fields; Temporal Cross-Mo

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。