QUICK REVIEW

[論文レビュー] Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Chong Tian, Yu Wang|arXiv (Cornell University)|Mar 16, 2026

Misinformation and Its Impacts被引用数 0

ひとこと要約

MAGIC 3は短編動画のフェイクニュースに対するクロスモーダル整合性レンズを導入し、テキスト–視覚–音声信号と不確実性ベースのVLMルーティングを用いて高いスループットと強い精度を実現します。

ABSTRACT

Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.

研究の動機と目的

各モダリティが単独ではもっともらしく見えるが、整合性が取れていない場合がある短尺動画における多模態誤情報検出の動機付け。
テキスト–視覚、テキスト–音声、視覚–音声のクロスモーダル整合性パターンを特徴付け、解釈可能なグローバル整合性軸を特定。
軽量で解釈可能な検出器を開発し、対になった整合性信号（ペアワイズ、グローバル、トークン/フレームレベル）と不確実性を明示して効率的検出を誘導。
整合性と不確実性を用いて heavyweight なビジョン–言語モデル（VLM）を呼び出すタイミングを決定する二段階ルーティングシステムを実現。
テキスト表現の改善とスタイル撹乱耐性を高めるため、マルチスタイルLLMリライトによるロバスト性を提供。

提案手法

Cross-Modal Consistency Gate (CMCG) を用いて明示的なクロスモーダル整合性を計算し、ペアワイズおよびグローバルな整合性スコアを得る。
Consistency Field Estimator (CFE) によるクロスモーダルアテンションからトークン・フレームレベルの整合性フィールドを導出。
Temporal Cross-Modal Inconsistency (TCMI) を組み込み、時間を通じた音声–視覚のずれを捉える。
Adversarial-Aware Rewrite Fusion (AARF) を用いて元のテキストとマルチスタイルLLMリライトを統合し、スタイル耐性のある表現を得る。
整合性重み付きクロスアテンションを備えた階層的多模态トランスフォーマー（HMT）を用いてグローバルな動画表現を生成。
CAJL（Contrastive–Adversarial Joint Learning）で訓練。監視付き損失、対比ロス（内在/クロス-modal）、敵対的整合性正則化、整合性正則化を組み合わせる。

Figure 1: Illustration of cross-modal consistency patterns. In real news short videos, text, visuals, and audio are contextually aligned (Consistent). In fake news, a “semantic gap” often exists between the sensational claims (text/audio) and the actual visual content. MAGIC 3 acts as a consistency

実験結果

リサーチクエスチョン

RQ1実在動画と偽動画を区別するクロスモーダル整合性パターンは何か。
RQ2軽量検出器は偽確率と予測難易度に相関する多粒度の整合性信号を露出できるか。
RQ3マルチスタイルLLMリライトを取り入れると偽ニュース検出におけるスタイル撹乱耐性が向上するか。
RQ4不確実性を考慮した heavyweight VLM へのルーティングは、VLMレベルの精度を大幅なスループット増加とともに達成できるか。
RQ5トークン/フレームレベルの整合性フィールドと時間的不整合は、ずれ信号の局在化にどのように寄与するか。

主な発見

実動画はテキスト–視覚の整合性が高く、テキスト–音声は中程度に整合する一方、偽動画は逆のパターン（高いテキスト–音声、低いテキスト–視覚）を示す。
単一のグローバル整合性スコアは予測難易度と相関し、予測エラーを中間値でクラスタリングする。
不確実性とグローバル整合性を用いた二段階ルーティングにより、約25％のサンプルをVLMへルーティングしつつ、競争力のある精度を達成し、はるかに高いスループットを実現。
MAGIC 3 は凍結特徴量を用いた偽SV・ FakeTT における最先端の監視付き性能を達成し、 heavyweight VLM と組み合わせると VLM のみの検出器を18–27倍のスループットで上回る。
AARF によるマルチスタイルLLMリライトはロバスト性を向上させ、AARF を除去すると特に FakeTT で性能低下が顕著。
アブレーション研究は、コア整合性モジュール（CMCG、CFE、TCMI）が性能にとって重要であることを示す。

Figure 2: MAGIC 3 Overview. Frozen encoders provide text, visual, audio, and rewrite features. The Cross-Modal Consistency Gate outputs pairwise and global consistency scores; Consistency Field Estimator converts cross-modal attention into token- and frame-level consistency fields; Temporal Cross-Mo

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。