QUICK REVIEW

[論文レビュー] TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Xiangzhao Hao, Shijie Wang|arXiv (Cornell University)|Mar 3, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

TRACEはまず構造化された推論トレースを生成し、それを検索埋め込みへ圧縮する統一的検索フレームワークを提案。これにより、普遍的マルチモーダル検索のタスク適応的推論を実現し、ゼロショット一般化を強化しつつ競争力のある効率性を確保する。

ABSTRACT

Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

研究の動機と目的

エンコーダ中心の検索から推論-エンコードへパラダイムを転換し、マルチモーダル検索における複雑で組成的なユーザー意図を扱えるようにする。
TRACEを設計して明示的な推論トレースを生成し、それを識別的タスクのための検索埋め込みへ圧縮する。
推論対応の検索機を訓練・評価するための大規模・品質フィルタリング済みのCoTベースデータセット（M-BEIR-CoT）を作成する。
単純なクエリは推論を回避して適応的ルーティングを行い、複雑なクエリはCoTを呼び出して精度とスループットのバランスを取る。
クエリ側と候補側の推論の非対称性のゼロショット移行性の証拠を示し、分析する。

提案手法

TRACEを提案。視覚エンコーダー、プロジェクター、LLMバックボーンを用いて生成-圧縮検索を実現。
生成的推論損失と識別的InfoNCE損失を組み合わせた混合目的で訓練。
単純クエリと複雑クエリをルーティングし、複雑クエリにはCoTを生成し、粗〜細のフィルタリングを適用してM-BEIR-CoTを構築。
専用の<|emb|>トークンの直前の隠れ状態から最終検索埋め込みを抽出し、エンドツーエンドの単一ステージ訓練を実現。
単純クエリには<|emb|>を直接出力するようモデルが学習し、複雑クエリにはCoTを生成する適応的ルーティングをデモンストレーション。

Figure 1 : The TRACE Framework. TRACE learns a query-dependent inference strategy. (a) For simple queries, it implicitly bypasses the reasoning stage and directly extracts features to maintain high efficiency. (b) For complex queries, it automatically activates the task-adaptive reasoning process. T

実験結果

リサーチクエスチョン

RQ1検索モデルは埋め込み過程に明示的推論トレースを統合することで利益を得るか。
RQ2タスク適応型推論は、複雑で組成的なマルチモーダルクエリの検索性能を効率性を損なうことなく改善できるか。
RQ3推論-エンベディングを統合した単一ステージモデルは、未知ドメインへの強力なゼロショット一般化を提供できるか。
RQ4クエリ側と候補側の推論の非対称性は検索性能にどう影響するか。
RQ5自動回帰生成の異なる位置から埋め込みを抽出することの影響は何か。

主な発見

q^t -> c^i	q^t -> c^t	q^t -> (c^i,c^t)	q^i -> c^t	q^i -> c^i	(q^i,q^t) -> c^t	(q^i,q^t) -> c^i	(q^i,q^t) -> (c^i,c^t)	Avg.
42.1	82.3	30.5	87.8	64.1	82.5	41.2	91.3	58.8

TRACEはM-BEIRベンチマークにおいて特に推論集約型タスクで最先端の性能を確立。
モデルは適応的ルーティングを示し、単純クエリは推論を回避して高いスループットを達成し、複雑クエリは推論によって精度が向上。
MSCOCO（単純）ではTRACEは高いスループットと競争力のある精度を達成。CIRR（複雑）では速度を犠牲にして精度を向上。
13の未知データセットでのゼロショット実験は、推論重視タスクへの強力な一般化と標準的マッチングタスクでの競争力を示す。
アブレーション研究により、事前トークン埋め込み抽出が最良の検索性能を生み、完全なCoTトレースが部分的または孤立した推論要素を上回る。
外部-CoT+エンコーダの二段階パイプラインはTRACEを下回り、推論のエンドツーエンド内部化の利点を強調。

Figure 2 : The construction pipeline of the M-BEIR-CoT dataset. The process operates in three phases: (1) Query Complexity Assessment: An advanced MLLM assesses query difficulty, routing simple queries to a direct path (generating only <|emb|> ) and complex queries to a reasoning path (generating Co

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。