QUICK REVIEW

[論文レビュー] Efficient Multimodal Planning Agent for Visual Question-Answering

Zhuo Job Chen, Xinyu Geng|arXiv (Cornell University)|Jan 28, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

この論文は、VQAを解くために必要なmRAGステップを動的に選択するマルチモーダルプランニングエージェントを紹介し、6つのデータセット全体で精度を維持または向上させつつ、効率を向上させます。

ABSTRACT

Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60\% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.

研究の動機と目的

VQAのための rigid なマルチモーダル RAG パイプラインの非効率を性能を犠牲にせずに削減する動機づけ。
エージェントを開発し、クエリに基づいて画像検索/テキスト検索をいつ使用し、ステップをスキップするかを決定する。
自動データ注釈とマルチモーダルLLMエージェントのファインチューニングを活用してワークフロー決定を学習する。

提案手法

VQAクエリを (画像 i, テキスト t) として定義し、真の回答 a を用意して、分解のための gold クエリ g と i を定義する。
強力なMLLMを用いて、クエリを image query qi、image answer ai、gold query qg に分解してデータを注釈付けする。
mRAGステップをスキップするか含めるかを導く4つのカテゴリ c1–c4 を定義し、エージェントがこれらのカテゴリを予測するように訓練する。
MLLMエージェント（rank 32でのLoRAが望ましい）をファインチューニングして、No mRAG、画像/テキスト mRAG、または両方のmRAG経路を選択するプランニングエージェントとして機能させる。
推論は予測されたカテゴリに基づいて適応し、直接回答する、クエリをリライトする、あるいは回答前に画像/テキストの文脈を取得する、のいずれかを行う。

Figure 1 : Workflow of our agent on solving VQA with dynamic mRAG strategies. The agent selects a sub-path based on different VQA inputs, which may require image search , query search , neither , or both .

実験結果

リサーチクエスチョン

RQ1各VQAクエリに対して、マルチモーダルプランニングエージェントは各mRAGステップの必要性を動的に決定できるのか。
RQ2動的プランニングは不要なツール使用と計算を削減しつつ、異なるデータセットでVQAの精度を維持または向上させるのか。
RQ3トレーニングモデル以外の別のMLLMへ転移した場合、提案手法の性能はどうなるのか。
RQ4パフォーマンスと効率の観点から、完全なファインチューニングとLoRAのトレードオフはどうなるのか。
RQ5エージェントはプロンプトベースのベースラインや既存のDeep Researchエージェントと比較して、待機時間と精度の点でどうか。

主な発見

エージェントは、6データセットの平均で OmniSearch と比較して探索時間を60%以上削減する。
エージェントはベースラインと同等またはそれ以上のタスク性能を達成しつつ、コストの高いツール呼び出しを削減する。
WebWatcherと比較して、ツール呼び出しの待機時間は平均で3倍〜4.5倍速い。
6つのデータセットで、デフォルトの全mRAG設定や他のベースラインより平均性能が良い。
rank 32のLoRAは、精度を犠牲にせずに完全なファインチューニングの強力でパラメータ効率の良い代替を提供する。
エージェントは複数のMLLM間で移植性を示し、no-mRAGおよびプロンプトベースのベースラインより一貫して性能を向上させる。

Figure 2 : Proposed data annotation method.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。