QUICK REVIEW

[論文レビュー] MCoT-MVS: Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning for Composed Image Retrieval

Xuri Ge, Chunhao Wang|arXiv (Cornell University)|Mar 18, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

MCoT-MVS はマルチモーダル・チェーン・オブ・思考を用いてユーザー意図を分解し、パッチレベルおよびインスタンスレベルの参照ビジュアルを選択し、修正テキストとターゲットテキストを組み合わせて CIRR および FashionIQ で最先端の CIR を実現する。

ABSTRACT

Composed Image Retrieval (CIR) aims to retrieve target images based on a reference image and modified texts. However, existing methods often struggle to extract the correct semantic cues from the reference image that best reflect the user's intent under textual modification prompts, resulting in interference from irrelevant visual noise. In this paper, we propose a novel Multi-level Vision Selection by Multi-modal Chain-of-Thought Reasoning (MCoT-MVS) for CIR, integrating attention-aware multi-level vision features guided by reasoning cues from a multi-modal large language model (MLLM). Specifically, we leverage an MLLM to perform chain-of-thought reasoning on the multimodal composed input, generating the retained, removed, and target-inferred texts. These textual cues subsequently guide two reference visual attention selection modules to selectively extract discriminative patch-level and instance-level semantics from the reference image. Finally, to effectively fuse these multi-granular visual cues with the modified text and the imagined target description, we design a weighted hierarchical combination module to align the composed query with target images in a unified embedding space. Extensive experiments on two CIR benchmarks, namely CIRR and FashionIQ, demonstrate that our approach consistently outperforms existing methods and achieves new state-of-the-art performance. Code and trained models are publicly released.

研究の動機と目的

マルチモーダルチェーンオブ思考推論（MLLM 指導）を用いて保持 content、削除 content、ターゲット content を分解し、ユーザーの変更意図を説明する。
推論指標に導かれたパッチレベルおよびインスタンスレベルで識別的な参照ビジュアルを選択する。
複数レベルの視覚情報を修正テキストおよび推論済みターゲット文脈と統一埋め込み空間で融合する。
CIRR および FashionIQ のベンチマークで最先端の CIR を達成する。
公開コードとモデルを含む再現性の高いパイプラインを提供する。

提案手法

事前学習済みの MLLM を用いて多模態 CoT 推論を実行し、保持テキスト RT、削除テキスト DT、およびターゲット推定テキスト TT を取得する。
CLIP-Text で RT/DT/TT をエンコードして推論表現を得て視覚選択を案内する。
Patch レベルのビジュアル参照選択（PVRS）を適用し、パッチ特徴と RT/DT 指標を用いて保持内容を強調しノイズを抑制する。
Grounded SAM 分割とインスタンス特徴を用いたインスタンスレベルの視覚参照選択（IVRS）を適用し、関連オブジェクトを強調する。
多段階のビジュアル（PVRS、IVRS）を修正テキスト（T）および TT と組み合わせ、学習可能な重みを持つ階層的結合（WHC）で加重統合する。
共通空間で構成クエリとターゲット画像の整合を促す結合埋め込み損失で訓練する。

実験結果

リサーチクエスチョン

RQ1マルチモーダル・チェーン・オブ・思考推論は CIR のための保持/削除の明示的分解をどの程度改善できるか？
RQ2パッチレベルおよびインスタンスレベルの多段階視覚参照選択は視覚ノイズを減らし検索精度を向上させるか？
RQ3多段階の視覚情報とテキスト信号の加重階層的融合は既存の CIR 融合戦略を上回るか？
RQ4提案された構成要素は CIRR および FashionIQ データセットで最先端の結果をもたらすか？

主な発見

Method	R@1	R@5	R@10	R@50	Avg
TIRG (Vo et al., 2019)	14.61	48.37	64.08	90.03	54.27
CIRPLANT (Liu et al., 2021)	19.55	52.55	68.39	92.38	58.22
ARTEMIS (Delmas et al., 2022)	16.96	46.10	61.31	87.73	53.03
CLIP4CIR (Baldrati et al., 2023)	38.53	69.98	81.86	95.93	71.58
TGCIR (Wen et al., 2023)	45.25	78.29	87.16	97.30	77.00
SADN (Wang et al., 2024)	44.27	78.10	87.71	97.89	76.99
DQU-CIR (Wen et al., 2024)	46.22	78.17	87.64	97.81	77.46
CaLa (Jiang et al., 2024)	49.11	81.21	89.59	98.00	79.48
SSN (Yang et al., 2024)	43.91	77.25	86.48	97.45	76.27
CASE (Levy et al., 2024)	49.35	80.02	88.75	97.47	78.90
CoVRBLIP (Ventura et al., 2024)	49.69	78.60	86.77	94.31	77.34
SPRC (Bai et al., 2024)	51.96	82.12	89.74	97.18	80.25
ENCODER (Li et al., 2025)	46.10	77.98	87.16	97.64	77.22
CIRLVLM (Sun et al., 2025)	53.64	83.76	90.60	97.93	81.48
CCIN (Tian et al., 2025)	53.41	84.05	91.17	98.00	81.66
MCoT-MVS (Ours)	55.33	84.75	91.45	98.55	82.52

CIRR において R@1/5/10/50 および Avg で最先端を達成し、R@1 で CCIN を 1.92% 上回り平均も改善。
FashionIQ では three カテゴリ（ Dresses、Shirts、Tops&Tees）および平均 Recall@10/50 の全てでベースラインを上回る。
アブレーションにより PVRS および IVRS がそれぞれ性能に意味のある寄与をし、併用が最良の結果を生む。
多段階のマルチステップ CoT 推論は単一ステップ推論より検索性能を向上させる。
MLLM からの推論済みターゲット文は修正テキストと統合することで性能を改善する。
WHC は適応的な統合重みを学習し、単純な総和ベースを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。