QUICK REVIEW

[논문 리뷰] TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Xiangzhao Hao, Shijie Wang|arXiv (Cornell University)|2026. 03. 03.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

TRACE는 먼저 구조화된 추론 흔적을 생성하고 이를 검색 임베딩으로 압축하는 통합 검색 프레임워크를 제안하여, 강력한 제로샷 일반화와 경쟁력 있는 효율성을 갖춘 보편적 다중모달 검색을 위한 작업-적응적 추론을 가능하게 한다.

ABSTRACT

Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong reasoning capabilities, prevailing adaptations confine them to static encoders, underutilizing their generative potential. This encoder-only paradigm struggles with complex intents that demand logical deduction rather than superficial pattern matching. To address this, we introduce TRACE (Task-adaptive Reasoning And Compressing Embeddings). TRACE unifies generative reasoning with discriminative representation learning. It first generates a structured Chain-of-Thought (CoT) to explicitly reason about the query, and subsequently compresses this reasoning trace into a compact embedding via a dedicated token. To train this framework, we construct M-BEIR-CoT, a large-scale dataset featuring a difficulty-aware routing strategy. Experiments on the M-BEIR benchmark establish TRACE as the new state-of-the-art. Crucially, TRACE demonstrates a learned implicit routing behavior. It autonomously activates reasoning for complex queries while bypassing it for simpler ones, achieving an optimal balance between retrieval accuracy and inference throughput. Furthermore, by internalizing the deductive process, TRACE exhibits remarkable zero-shot transferability to unseen domains and novel constraints.

연구 동기 및 목표

멀티모달 검색에서 복잡하고 구성적(compositional)인 사용자 의도를 다루기 위해 인코더-전용 검색에서 추론-후 인코딩으로의 패러다임 전환을 촉구한다.
명시적 추론 흔적을 생성하고 이를 검색 임베딩으로 압축하여 판별적 작업에 활용하도록 TRACE를 설계한다.
추론 인식 검색기를 학습·평가하기 위해 대규모의 고품질 CoT 기반 데이터셋(M-BEIR-CoT)을 생성한다.
간단한 질의는 추론을 우회하고 복잡한 질의는 CoT를 호출하도록 적응형 라우팅을 시연하여 정확도와 처리량의 균형을 달성한다.
제로샷 전이 가능성의 증거를 제시하고 질의 측과 후보 측 추론 간의 비대칭성을 분석한다.

제안 방법

생성-후-압축 검색을 수행하기 위해 비전 인코더, 프로젝터, LLM 백본을 사용하는 TRACE를 제안한다.
생성적 추론 손실과 판별적 InfoNCE 손실을 결합한 하이브리드 목적함수로 학습한다.
단순 질의와 복잡 질의를 라우팅하고, 복잡한 질의에 CoT를 생성하며, 거친-세밀 필터링을 적용하여 M-BEIR-CoT를 구성한다.
전용 <|emb|> 토큰 직전의 히든 스테이트에서 최종 검색 임베딩을 추출하여 엔드투엔드 단일 단계 학습을 가능하게 한다.
모델이 단순 질의에 대해 직접 <|emb|>를 출력하고, 복잡 질의에 대해서는 CoT를 생성하도록 학습하는 적응형 라우팅을 시연한다.

Figure 1 : The TRACE Framework. TRACE learns a query-dependent inference strategy. (a) For simple queries, it implicitly bypasses the reasoning stage and directly extracts features to maintain high efficiency. (b) For complex queries, it automatically activates the task-adaptive reasoning process. T

실험 결과

연구 질문

RQ1검색 모델이 명시적 추론 흔적을 임베딩 과정에 통합하는 것에서 이점을 얻을 수 있는가?
RQ2작업-적응적 추론이 효율성을 저하시키지 않으면서 복잡하고 구성적인 다중모달 질의의 검색 성능을 향상시키는가?
RQ3추론-후 임베딩을 갖춘 단일 단계 모델이 보지 못한 도메인에 대해 강력한 제로샷 일반화를 제공할 수 있는가?
RQ4질의 측과 후보 측 추론 간의 비대칭성이 검색 성능에 어떤 영향을 미치는가?
RQ5자가회귀 생성의 서로 다른 위치에서 임베딩을 추출하는 것이 어떤 효과를 미치는가?

주요 결과

TRACE는 특히 추론이 집중된 작업에서 M-BEIR 벤치마크에서 최첨단 성능을 달성한다.
모델은 적응형 라우팅을 보여주며, 간단한 질의는 종종 추론을 우회하고 높은 처리량을 achievement; 복잡한 질의는 추론으로 정확도가 향상된다.
MSCOCO(단순)에서 높은 처리량과 경쟁력 있는 정확도; CIRR(복잡)에서는 속도 일부를 포기하고 정확도를 얻는다.
13개 미선행 데이터셋에 대한 제로샷 실험은 추론이 많은 작업에 대한 강한 일반화와 표준 매칭 작업에서의 경쟁력 있는 성능을 보여준다.
Ablation 연구 결과, 사전 토큰 임베딩 추출이 최상의 검색 성능을 낳고, 전체 CoT 흔적이 부분적이거나 고립된 추론 구성요소보다 우수하다.
외부-CoT + 인코더 2단계 파이프라인은 TRACE보다 낮은 성능을 보이며, 추론의 엔드투엔드 내부화의 이점을 강조한다.

Figure 2 : The construction pipeline of the M-BEIR-CoT dataset. The process operates in three phases: (1) Query Complexity Assessment: An advanced MLLM assesses query difficulty, routing simple queries to a direct path (generating only <|emb|> ) and complex queries to a reasoning path (generating Co

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.