QUICK REVIEW

[논문 리뷰] DQE-CIR: Distinctive Query Embeddings through Learnable Attribute Weights and Target Relative Negative Sampling in Composed Image Retrieval

Geon Park, Ji-Hoon Park|arXiv (Cornell University)|2026. 03. 04.

Image Retrieval and Classification Techniques인용 수 0

한 줄 요약

DQE-CIR 은 학습 가능한 속성 가중치와 목표 상대 음수 샘플링으로 차별적이고 속성 인식이 가능한 쿼리 임베딩을 학습하여 미세한 CIR 성능을 향상시키고 관련성 억제와 의미 혼동을 줄인다.

ABSTRACT

Composed image retrieval (CIR) addresses the task of retrieving a target image by jointly interpreting a reference image and a modification text that specifies the intended change. Most existing methods are still built upon contrastive learning frameworks that treat the ground truth image as the only positive instance and all remaining images as negatives. This strategy inevitably introduces relevance suppression, where semantically related yet valid images are incorrectly pushed away, and semantic confusion, where different modification intents collapse into overlapping regions of the embedding space. As a result, the learned query representations often lack discriminativeness, particularly at fine-grained attribute modifications. To overcome these limitations, we propose distinctive query embeddings through learnable attribute weights and target relative negative sampling (DQE-CIR), a method designed to learn distinctive query embeddings by explicitly modeling target relative relevance during training. DQE-CIR incorporates learnable attribute weighting to emphasize distinctive visual features conditioned on the modification text, enabling more precise feature alignment between language and vision. Furthermore, we introduce target relative negative sampling, which constructs a target relative similarity distribution and selects informative negatives from a mid-zone region that excludes both easy negatives and ambiguous false negatives. This strategy enables more reliable retrieval for fine-grained attribute changes by improving query discriminativeness and reducing confusion caused by semantically similar but irrelevant candidates.

연구 동기 및 목표

표준 대조학습을 넘어서 CIR 쿼리 임베딩의 식별성을 개선하도록 동기를 부여한다.
수정 텍스트에 조건화된 주요 속성을 강조하여 미세한 속성 중심 검색을 가능하게 한다.
의미적으로 관련되지만 비타깃인 이미지로 인해 발생하는 관련성 억제 및 의미 혼동을 완화한다.
랭킹 강화를 위한 타깃 상대적 중간 영역에서 정보성이 높은 음수를 선택하는 학습 스킴을 제안한다.

제안 방법

참조 이미지, 수정 텍스트, 후보 이미지를 인코딩하기 위해 BLIP-2를 백본으로 사용한다.
학습 가능한 속성 가중치를 도입하여 속성 인식 하위 질의(색상과 형태)를 만들고 이를 최종 쿼리 임베딩으로 결합한다.
Δ-스코어 분포를 기반으로 음수의 중간 영역을 구성하는 타깃 상대 음수 샘플링을 정의하고 이 영역에서 단일 음수로 학습한다.
구성된 쿼리가 타깃 이미지와 정렬되도록 KL 발산 항을 포함하는 쌍대 학습 목적을 적용하고 중간 영역 음수와 차별화한다.
색상 및 형태 특이적 식별성을 강제하기 위해 전용 여백 손실이 있는 보조 속성 지향 하위 질의를 통합한다.
임베딩 공간이 진화함에 따라 정보를 유지하기 위해 타깃 상대 음수 세트를 새로 고치도록 간격 기반 일정으로 훈련한다.

실험 결과

연구 질문

RQ1수정 텍스트의 핵심 속성을 강조하여 학습 가능한 속성 가중치가 CIR를 위한 더 식별 가능한 쿼리 임베딩을 도출할 수 있는가?
RQ2타깃 상대 음수 샘플링이 미세한 구분력을 개선하고 CIR 학습에서 관련성 억제를 줄이는가?
RQ3속성별 여유 및 KL 가이던스로 보완된 단일 음수 쌍대 랭킹 목적이 CIR에서 표준 대비 학습 objective를 능가하는가?
RQ4감독 학습 및 제로샷 설정에서 FashionIQ와 CIRR에서 DQE-CIR의 전 세계 검색과 미세 속성 정렬 측면의 성능은 어떠한가?

주요 결과

방법	Dress R@10	Dress R@50	Shirt R@10	Shirt R@50	Toptee R@10	Toptee R@50	Average R@10	Average R@50
CoSMo	23.60	49.18	18.11	43.18	24.63	54.31	22.11	48.89
MGUR	23.15	48.74	18.99	43.47	25.55	52.83	22.56	48.35
CLIP4Cir	38.32	63.90	44.31	65.41	47.27	70.98	43.30	66.76
Bi-BLIP4CIR	39.12	62.92	39.21	62.81	44.37	67.06	40.90	64.26
CoVR	44.55	69.03	48.43	67.42	52.60	74.31	48.53	70.25
SPRC	45.71	70.00	51.37	72.77	55.48	77.46	50.86	73.41
QuRe	46.80	69.81	53.53	72.87	57.47	77.77	52.60	73.48
DQE-CIR	48.47	71.09	55.94	74.62	59.38	79.12	54.60	75.94

DQE-CIR은 FashionIQ의 Dress, Shirt, Toptee 카테고리에서 기존 CIR 방법들을 지속적으로 능가하며(최고의 Recall@10 및 Recall@50).
FashionIQ에서 DQE-CIR은 이전 방법들보다 평균 Recall@10 및 Recall@50가 더 높아 전체 검색 및 속성 정렬이 더 강함을 시사한다.
CIRR에서 DQE-CIR은 평가된 모든 순위에서 상위 Recall@K를 달성하고 가장 좋은 Recall subset@K를 보유하여 시각적으로 유사한 하위집합에서도 강력한 타깃 구별성을 입증한다.
언급된 제거형 분석은 타깃 상대 음수 샘플과 속성 인식 쌍대 학습의 신뢰할 수 있는 CIR에 대한 중요성을 확인한다.
정성적 결과는 DQE-CIR이 여러 속성 수정 조건을 만족하는 이미지를 기준선보다 더 정확하게 검색함을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.