QUICK REVIEW

[논문 리뷰] Mimic Human Cognition, Master Multi-Image Reasoning: A Meta-Action Framework for Enhanced Visual Understanding

Jianghao Yin, Qingbin Li|arXiv (Cornell University)|2026. 01. 12.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

CINEMA는 다중 이미지, 다중 프레임, 단일 이미지 추론에 탁월하도록 Retrieval-Based Tree Sampling과 두 단계 강화학습 프로세스를 갖춘 인지에서 영감을 받은 메타 액션 프레임워크를 도입하여 여러 벤치마크에서 최첨단 성능을 달성합니다.

ABSTRACT

While Multimodal Large Language Models (MLLMs) excel at single-image understanding, they exhibit significantly degraded performance in multi-image reasoning scenarios. Multi-image reasoning presents fundamental challenges including complex inter-relationships between images and scattered critical information across image sets. Inspired by human cognitive processes, we propose the Cognition-Inspired Meta-Action Framework (CINEMA), a novel approach that decomposes multi-image reasoning into five structured meta-actions: Global, Focus, Hint, Think, and Answer which explicitly modeling the sequential cognitive steps humans naturally employ. For cold-start training, we introduce a Retrieval-Based Tree Sampling strategy that generates high-quality meta-action trajectories to bootstrap the model with reasoning patterns. During reinforcement learning, we adopt a two-stage paradigm: an exploration phase with Diversity-Preserving Strategy to avoid entropy collapse, followed by an annealed exploitation phase with DAPO to gradually strengthen exploitation. To train our model, we construct a dataset of 57k cold-start and 58k reinforcement learning instances spanning multi-image, multi-frame, and single-image tasks. We conduct extensive evaluations on multi-image reasoning benchmarks, video understanding benchmarks, and single-image benchmarks, achieving competitive state-of-the-art performance on several key benchmarks. Our model surpasses GPT-4o on the MUIR and MVMath benchmarks and notably outperforms specialized video reasoning models on video understanding benchmarks, demonstrating the effectiveness and generalizability of our human cognition-inspired reasoning framework.

연구 동기 및 목표

인간과 유사한 인지 단계를 모델링하여 다중 이미지 설정에서 멀티모달 추론의 개선을 촉진한다.
이미지 세트 간 추론 구조화를 위한 다섯 가지 메타 액션 프레임워크를 제안한다.
추론 궤적을 초기화하고 다듬기 위한 데이터 생성 및 학습 전략을 개발한다.
강력한 벤치마크 결과와 함께 다중 이미지, 다중 프레임, 단일 이미지 작업에 걸친 일반화를 시연한다.

제안 방법

순차적 추론을 안내하기 위해 Global, Focus, Hint, Think, Answer의 다섯 가지 메타 액션을 정의한다.
student-teacher 정제와 검색을 통해 다양하고 고품질의 추론 궤적을 생성하기 위해 Retrieval-Based Tree Sampling을 도입한다.
다중 이미지, 다중 프레임, 단일 이미지 작업을 포괄하는 57k개의 cold-start와 58k개의 강화 학습 인스턴스로 학습 데이터 세트를 구성한다.
탐색을 유지하기 위한 다양성 보존 전략과 이용을 위한 가열된 DAPO를 차례로 적용하는 두 단계 강화 학습 패러다임을 채택한다.
지정된 RL 및 프롬프트 설정과 함께 Qwen2.5VL 7B 백본에서 학습하고, 수학 작업에는 math_verify/mathruler를, 다른 작업에는 정확한 문자열 일치를 사용한다.

실험 결과

연구 질문

RQ1다양한 추론 궤적이 다중 이미지 추론 성능을 향상시킬 수 있는가?
RQ2다중 이미지 작업에서 입력 이미지의 개수 변화에 모델이 어떻게 대응하는가?
RQ3다중 이미지, 비디오, 단일 이미지 등 다양한 작업 범주에서 CINEMA의 성능은 어떤가?
RQ4각 메타 액션이 전체 성능에 기여하는 바는 무엇인가?
RQ5두 단계 강화 학습이 엔트로피, 탐색 및 성능에 미치는 영향은 무엇인가?

주요 결과

모델	MUIR	MMIU	MVMATH	EMMA	MIRB	Mantis	MVBench	VideoMME	VideoMMMU	전체
Ours	71.6	53.3	36.9	29.3	55.2	67.7	66.5	59.4	49.0	54.3
Ours [with DPS]	67.9	52.2	35.1	28.4	54.4	71.0	67.1	60.2	51.6	54.2
Ours [with DPS and annealing]	71.0	52.2	35.0	28.6	55.7	68.4	66.8	61.0	50.1	54.3

MUIR, MVMath, EMMA, VideoMME, VideoMMMU를 포함한 다수의 다중 이미지 벤치마크에서 최첨단 성능을 달성한다.
다중 이미지 설정에서 MUIR 및 MVMath 벤치마크에서 GPT-4o를 능가한다.
비디오 이해 벤치마크에서 다수의 전문 비디오 추론 모델을 능가한다.
단일 이미지에서 강력한 성능을 보여 일부 전용 단일 이미지 모델과 동등하거나 더 우수한 성능을 달성한다.
다양성 보존이 있는 두 단계 RL은 더 높은 엔트로피와 다양한 궤적을 유지하면서도 경쟁력 있는 정확도를 달성한다.
인스턴스당 두 개의 궤적을 포함하는 Retrieval-Based Tree Sampling이 단일 궤적 학습에 비해 평균 성능을 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.