Skip to main content
QUICK REVIEW

[论文解读] Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding

Jianghao Yin, Qingbin Li|arXiv (Cornell University)|Jan 12, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

CINEMA 引入以认知为灵感的元动作框架,采用基于检索的树采样和两阶段强化学习过程,在多图、多帧和单图推理上表现出色,且在若干基准测试中达到最先进水平。

ABSTRACT

While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.

研究动机与目标

  • 通过建模类似人类的认知步骤,推动多模态推理在多图设置中的改进。
  • 提出五个元动作框架来为跨图像集合的推理构建结构。
  • 开发数据生成与训练策略,以引导并优化推理轨迹。
  • 在多图、多帧和单图任务上展示良好泛化能力,并在基准测试中取得强劲结果。

提出的方法

  • 定义五个元动作:全局(Global)、聚焦(Focus)、提示(Hint)、思考(Think)、回答(Answer)以引导顺序推理。
  • 引入基于检索的树采样,通过学生-教师的精炼与检索,生成多样且高质量的推理轨迹。
  • 构建包含57k个冷启动样本和58k个强化学习样本的训练数据集,覆盖多图、多帧和单图任务。
  • 采用两阶段强化学习范式:保留多样性以维持探索的Diversity-Preserving策略,随后使用退火的DAPO进行开发。
  • 在Qwen2.5VL 7B骨干上训练,采用指定的RL与提示设置;对数学任务使用math_verify/mathruler,对其他任务使用精确字符串匹配。

实验结果

研究问题

  • RQ1多样化的推理轨迹是否能提升多图推理性能?
  • RQ2模型如何处理多图任务中输入图像数量的变化?
  • RQ3CINEMA在不同任务类别(多图、视频、单图)上的表现如何?
  • RQ4每个元动作对总体性能的贡献是什么?
  • RQ5两阶段强化学习如何影响熵、探索与性能?

主要发现

模型MUIRMMIUMVMATHEMMAMIRBMantisMVBenchVideoMMEVideoMMMUOverall
Ours71.653.336.929.355.267.766.559.449.054.3
Ours [with DPS]67.952.235.128.454.471.067.160.251.654.2
Ours [with DPS and annealing]71.052.235.028.655.768.466.861.050.154.3
  • 在包括MUIR、MVMath、EMMA、VideoMME和VideoMMMU在内的多个多图基准上达到最先进水平。
  • 在多图设置下在MUIR和MVMath基准上超越GPT-4o。
  • 在视频理解基准上超越若干专门化视频推理模型。
  • 在单图任务上表现强劲,达到或超过一些专门的单图模型。
  • 两阶段RL结合多样性保留在实现竞争性准确性的同时保持更高的熵与多样轨迹。
  • 基于检索的树采样对每个样本使用两条轨迹的设置,相较单轨迹训练提升平均性能。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。