QUICK REVIEW

[논문 리뷰] Exposing Cross-Modal Consistency for Fake News Detection in Short-Form Videos

Chong Tian, Yu Wang|arXiv (Cornell University)|2026. 03. 16.

Misinformation and Its Impacts인용 수 0

한 줄 요약

MAGIC 3은 텍스트-비주얼-오디오 신호와 불확실성 기반 VLM 라우팅을 활용하여 짧은 비디오의 가짜 뉴스에 대한 교차 모달 일관성 렌즈를 도입하고, 높은 처리량과 함께 강력한 정확도를 달성합니다.

ABSTRACT

Short-form video platforms are major channels for news but also fertile ground for multimodal misinformation where each modality appears plausible alone yet cross-modal relationships are subtly inconsistent, like mismatched visuals and captions. On two benchmark datasets, FakeSV (Chinese) and FakeTT (English), we observe a clear asymmetry: real videos exhibit high text-visual but moderate text-audio consistency, while fake videos show the opposite pattern. Moreover, a single global consistency score forms an interpretable axis along which fake probability and prediction errors vary smoothly. Motivated by these observations, we present MAGIC3 (Modal-Adversarial Gated Interaction and Consistency-Centric Classifier), a detector that explicitly models and exposes cross-tri-modal consistency signals at multiple granularities. MAGIC3 combines explicit pairwise and global consistency modeling with token- and frame-level consistency signals derived from cross-modal attention, incorporates multi-style LLM rewrites to obtain style-robust text representations, and employs an uncertainty-aware classifier for selective VLM routing. Using pre-extracted features, MAGIC3 consistently outperforms the strongest non-VLM baselines on FakeSV and FakeTT. While matching VLM-level accuracy, the two-stage system achieves 18-27x higher throughput and 93% VRAM savings, offering a strong cost-performance tradeoff.

연구 동기 및 목표

각 모달리티가 단독으로는 그럴듯해 보이지만 서로 불일치하게 정렬되는 짧은 형식 비디오에서 다중 모달 허위정보 탐지의 필요성과 중요성을 고무한다.
텍스트–비주얼, 텍스트–오디오, 비주얼–오디오 간의 교차 모달 일관성 패턴을 특징화하고 해석 가능한 전역 일관성 축을 식별한다.
가볍고 해석 가능한 탐지기를 개발하여 쌍별/전역/토큰-프레임 수준의 다중 그레인 일관성 신호와 불확실성을 노출하고 효율적인 탐지를 안내한다.
일관성과 불확실성을 활용하여 중량급 비전–언어 모델(VLM)을 호출할 시점을 결정하는 2단계 라우팅 시스템을 가능하게 한다.
텍스트 표현을 개선하고 스타일 변형에 대한 탄력성을 높이기 위해 다중 스타일 LLM 재작성으로 견고성을 제공한다.

제안 방법

Cross-Modal Consistency Gate (CMCG) 를 통해 명시적 교차 모달 일관성을 계산하여 쌍별 및 전역 일관성 점수를 얻는다.
Consistency Field Estimator (CFE)를 사용하여 교차 모달 어텐션으로부터 토큰- 및 프레임 수준의 일관성 필드를 도출한다.
시간에 걸친 오디오-비주얼 불일치를 포착하기 위해 Temporal Cross-Modal Inconsistency (TCMI)를 도입한다.
스타일 강인한 표현을 위해 원문 텍스트를 다중 스타일 LLM 재작성과 융합하는 Adversarial-Aware Rewrite Fusion (AARF)을 사용한다.
일관성 가중 교차 어텐션을 갖춘 계층적 다중모달 트랜스포머(HMT)를 사용하여 글로벌 비디오 표현을 구성한다.
감독 학습 손실, 내부/교차 모달 대조 손실, 적대적 일관성 규제 및 일관성 규제를 결합한 대조-적대적 결합 학습(CAJL)으로 학습한다.

Figure 1: Illustration of cross-modal consistency patterns. In real news short videos, text, visuals, and audio are contextually aligned (Consistent). In fake news, a “semantic gap” often exists between the sensational claims (text/audio) and the actual visual content. MAGIC 3 acts as a consistency

실험 결과

연구 질문

RQ1실제 영상과 가짜 짧은 형식 영상의 차이를 만드는 교차 모달 일관성 패턴은 무엇인가?
RQ2가벼운 탐지기가 가짜 확률 및 예측 난이도와 상관관계가 있는 다중 그레인 일관성 신호를 노출할 수 있는가?
RQ3다중 스타일 LLM 재작성의 도입이 가짜 뉴스 탐지에서 스타일 변형에 대한 견고성을 향상시키는가?
RQ4불확실성 인식 라우팅이 중량급 VLM에 도달하여 훨씬 높은 처리량으로 VLM 수준의 정확도를 달성할 수 있는가?
RQ5토큰-/프레임 수준의 일관성 필드와 시간적 불일치가 비정합 신호의 위치 지시에 어떻게 기여하는가?

주요 결과

실제 비디오는 텍스트–비주얼 일관성이 높고 텍스트–오디오 일관성은 보통이며, 가짜 비디오는 그 반대 패턴을 보인다(높은 텍스트–오디오, 낮은 텍스트–비주얼).
단일 글로벌 일관성 점수는 예측 난이도와 상관관계가 있으며 예측 오류를 중간 값에서 군집시킨다.
불확실성과 글로벌 일관성을 이용한 2단계 라우팅은 샘플의 약 25%를 VLM으로 라우팅하게 하며, 처리량은 크게 증가시키면서도 경쟁력 있는 정확도를 달성한다.
MAGIC 3은 고정된 특징을 사용하여 FakeSV와 FakeTT에서 최첨단 지도학습 성능을 달성하고, 중량급 VLM과 결합하면 VLM 전용 탐지기를 18–27배 더 높은 처리량으로 능가한다.
AARF를 통한 다중 스타일 LLM 재작성은 견고성을 향상시키고, AARF를 제거하면 특히 FakeTT에서 성능이 감소한다.
고찰 연구는 핵심 일관성 모듈(CMCG, CFE, TCMI)이 성능에 결정적임을 보여준다.

Figure 2: MAGIC 3 Overview. Frozen encoders provide text, visual, audio, and rewrite features. The Cross-Modal Consistency Gate outputs pairwise and global consistency scores; Consistency Field Estimator converts cross-modal attention into token- and frame-level consistency fields; Temporal Cross-Mo

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.