QUICK REVIEW

[论文解读] TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Xiangzhao Hao, Shijie Wang|arXiv (Cornell University)|Mar 3, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

TRACE 提出一个统一的检索框架，先生成结构化的推理轨迹再将其压缩成检索嵌入，使任务自适应推理用于通用多模态检索，具备强零-shot 泛化与具有竞争力的效率。

ABSTRACT

Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

研究动机与目标

推动从仅编码检索到先推理再编码的范式转变，以处理多模态检索中的复杂、多成分的用户意图。
设计 TRACE 以生成显式推理轨迹并将其压缩为检索嵌入，以用于判别性任务。
创建大规模、经过质量筛选的基于推理链（CoT）的数据集（M-BEIR-CoT），用于训练和评估具备推理意识的检索器。
展示自适应路由：简单查询跳过推理以提升吞吐，复杂查询则调用 CoT 以平衡准确性与吞吐量。
展示零-shot 转移能力的证据并分析查询端与候选端推理之间的非对称性。

提出的方法

提出 TRACE：使用视觉编码器、投影器和 LLM 主干网络来执行生成-再压缩的检索。
用生成推理损失和判别性 InfoNCE 损失的混合目标进行训练。
通过对简单查询与复杂查询进行路由，生成 CoT（复杂查询）并应用粗到细的筛选来构建 M-BEIR-CoT。
从在专用 <|emb|> 令牌前的隐藏状态提取最终检索嵌入，使端到端的一阶段训练成为可能。
展示自适应路由：模型学习对简单查询直接输出 <|emb|>，对复杂查询生成 CoT。

Figure 1 : The TRACE Framework. TRACE learns a query-dependent inference strategy. (a) For simple queries, it implicitly bypasses the reasoning stage and directly extracts features to maintain high efficiency. (b) For complex queries, it automatically activates the task-adaptive reasoning process. T

实验结果

研究问题

RQ1一个检索模型是否能从将显式推理轨迹整合到其嵌入过程受益？
RQ2任务自适应推理是否能在不牺牲效率的前提下提升对复杂、成分化多模态查询的检索性能？
RQ3一个统一的一阶段模型在推理后编码的情形下是否能对未见领域提供强零-shot 泛化？
RQ4查询端与候选端推理之间的非对称性如何影响检索性能？
RQ5从自回归生成的不同位置提取嵌入会产生哪些影响？

主要发现

TRACE 在 M-BEIR 基准上达到最先进的性能，尤其是在推理密集型任务上。
模型展示自适应路由：简单查询通常绕过推理以实现高吞吐；复杂查询受益于推理，准确性提升。
在 MSCOCO（简单）上 TRACE 以高吞吐和有竞争力的准确性表现；在 CIRR（复杂）上通过以速度换取部分准确性来提升准确性。
对13个未见数据集的零-shot 实验显示对推理密集任务具有较强的泛化能力，在标准匹配任务上也有竞争力表现。
消融研究表明预令牌嵌入提取可获得最佳检索性能，完整的 CoT 轨迹优于部分或孤立的推理组件。
外部-CoT + 编码器两阶段管线的表现不及 TRACE，凸显将推理端到端内部化的优势。

Figure 2 : The construction pipeline of the M-BEIR-CoT dataset. The process operates in three phases: (1) Query Complexity Assessment: An advanced MLLM assesses query difficulty, routing simple queries to a direct path (generating only <|emb|> ) and complex queries to a reasoning path (generating Co

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。