QUICK REVIEW

[Paper Review] Agentic Mixed-Source Multi-Modal Misinformation Detection with Adaptive Test-Time Scaling

Wei Jiang, Tong Chen|arXiv (Cornell University)|Mar 3, 2026

Misinformation and Its Impacts0 citations

TL;DR

AgentM 3 D introduces a multi-agent framework with adaptive test-time scaling and critique-aware Best-of-N reasoning to detect mixed-source multi-modal misinformation in zero-shot settings, achieving state-of-the-art results with efficient inference.

ABSTRACT

Vision-language models (VLMs) have been proven effective for detecting multi-modal misinformation on social platforms, especially in zero-shot settings with unavailable or delayed annotations. However, a single VLM's capacity falls short in the more complex mixed-source multi-modal misinformation detection (M3D) task. Taking captioned images as an example, in M3D, false information can originate from untruthful texts, forged images, or mismatches between the two modalities. Although recent agentic systems can handle zero-shot M3D by connecting modality-specific VLM agents, their effectiveness is still bottlenecked by their architecture. In existing agentic M3D solutions, for any input sample, each agent performs only one forward reasoning pass, making decisions prone to model randomness and reasoning errors in challenging cases. Moreover, the lack of exploration over alternative reasoning paths prevents modern VLMs from fully utilizing their reasoning capacity. In this work, we present AgentM3D, a multi-agent framework for zero-shot M3D. To amplify the reasoning capability of VLMs, we introduce an adaptive test-time scaling paradigm in which each modality-specific VLM agent applies a Best-of-N mechanism, coupled with a critic agent for task-aligned scoring. The agents are organized in a cascading, modality-specific decision chain to reduce unnecessary computation and limit error propagation. To ensure scalability, a planning agent dynamically determines the maximum number of reasoning paths based on sample difficulty, and an adaptive stopping mechanism prevents excessive reasoning within each agent. Extensive experiments on two M3D benchmarks demonstrate that AgentM3D achieves state-of-the-art zero-shot detection performance compared with various VLM-based and agentic baselines.

Motivation & Objective

Motivate robust detection of mixed-source multi-modal misinformation (M3D) where text, image, and cross-modal signals can be independently distorted.
Propose a hierarchical cascade of modality-specific detection agents to reduce error propagation.
Introduce adaptive test-time scaling (Best-of-N with critique-aware ranking) and a planning module to balance accuracy and efficiency.
Provide task-aligned scoring through reward models and modality-specific critique signals.
Demonstrate state-of-the-art zero-shot performance on M3D benchmarks with improved efficiency.

Proposed method

Three modality-specific detection agents (textual veracity, visual veracity, cross-modal consistency) are organized in a hierarchical cascade.
Best-of-N reasoning with critique-aware ranking is used to explore multiple reasoning trajectories for each agent, with a fused score guiding selection.
A planning agent dynamically decides when to activate enhanced reasoning, enabling adaptive test-time scaling.
Critique signals from modality-specific tools (logic consistency, image forgery detectors) accompany reward signals to inform candidate ranking.
Adaptive top-m early-stopping reduces computation by stopping once the top candidates are sufficiently distinguished.
Formal probabilistic interpretation ties agent inference to a posterior-like distribution with a scoring function combining rewards and critiques.

Figure 1 . Comparison between single-source and mixed-source multi-modal misinformation detection.

Experimental results

Research questions

RQ1How does AgentM 3 D perform compared with strong VLM-based baselines and agentic methods for zero-shot M3D?
RQ2Can adaptive test-time scaling balance accuracy and inference efficiency better than existing approaches?
RQ3What is the contribution of adaptive BoN reasoning and critique signals to detection performance?
RQ4How do planner and early-stopping mechanisms affect cost and reliability?
RQ5What is the impact of hyperparameters on performance and efficiency?

Key findings

Backbone Method	MMFakeBench Acc	MMFakeBench F1	MMFakeBench Rec	MMFakeBench Pre	Combined Acc	Combined F1	Combined Rec	Combined Pre
Qwen3-VL-4B Standard	42.9	29.2	35.8	35.8	30.3	23.6	31.0	42.3
Qwen3-VL-4B BoN	43.7	31.1	36.4	47.9	28.2	21.9	29.5	40.5
Qwen3-VL-4B T2 Agent	50.1	50.3	49.4	54.6	35.4	35.6	38.2	45.7
Qwen3-VL-4B MMD-Agent	55.2	55.4	55.8	57.1	41.9	40.9	44.1	48.5
Qwen3-VL-4B MMD-Agent+BoN	57.4	57.8	58.6	58.5	40.6	39.7	42.7	48.6
Qwen3-VL-4B AgentM3D (Ours)	58.1	58.0	60.0	57.1	45.4	45.6	47.3	49.0
Qwen3-VL-8B Standard	46.9	37.0	39.4	59.9	33.6	28.9	36.0	40.6
Qwen3-VL-8B BoN	45.7	35.6	38.4	62.5	33.6	28.4	36.3	42.9
Qwen3-VL-8B T2 Agent	54.3	54.0	52.0	61.3	36.2	36.1	38.8	45.5
Qwen3-VL-8B MMD-Agent	59.4	60.2	60.3	62.5	43.3	43.5	45.2	50.5
Qwen3-VL-8B MMD-Agent+BoN	60.1	60.7	60.4	62.9	42.3	42.6	44.3	48.7
Qwen3-VL-8B AgentM3D (Ours)	62.0	62.6	64.2	62.1	48.1	48.3	50.5	52.4

AgentM 3 D achieves strongest performance across MMFakeBench and Combined benchmarks compared with VLM-based and agentic baselines.
Adaptive planning triggers BoN reasoning for about 69.1% of MMFakeBench and 77.2% of Combined samples, enabling efficient yet effective inference.
Critique-aware BoN improves stability and accuracy where naive BoN or single-pass reasoning fail.
AgentM 3 D attains higher accuracy with moderate latency increase, offering a favorable accuracy–latency trade-off.
On Qwen3-VL-4B-Instruct, AgentM 3 D reaches Acc 58.1 (MMFakeBench) and 45.4 (Combined) with higher F1/Recall/Precision in several metrics; on Qwen3-VL-8B-Instruct, Acc 62.0 (MMFakeBench) and 48.1 (Combined).

Figure 2 . The overall structure of AgentM 3 D. A planning agent routes each input to either standard reasoning or critique-aware Best-of- $N$ reasoning. The latter explores multiple reasoning trajectories, integrates reward and critique signals for candidate selection, and applies adaptive early-st

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.