QUICK REVIEW

[논문 리뷰] Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction

Aditya Sarkar, Yi (Joy) Li|arXiv (Cornell University)|2026. 01. 30.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

논문은 MA-PaPSP를 제안하며, 검색 기반 프록시 임베딩과 대조 정규화를 사용해 개방형 집합 선택적 예측을 개선하는 메모리-증강형 훈련-없는 선택적 예측 방법으로, 어떤 비전-언어 모델에도 부착될 수 있다.

ABSTRACT

Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification. Code is publicly available at https://github.com/kingston-aditya/MA-PaPSP.

연구 동기 및 목표

선택적 예측을 개방형 집합의 비전-언어 작업으로 확장합니다(예: 캡션 생성, ITM, 미세한 분류).
경량의 훈련 없이 작동하는 모듈을 제공하여 어떤 VLM에도 부착되어 보정된 신뢰도 점수를 생성할 수 있습니다.
PaPSP에 사용된 외부 VLRMs의 임베딩 불안정성과 점수 보정 실패를 감소시킵니다.
캡션 생성, ITM, 분류를 포함한 여러 데이터셋과 기초 모델에서 이점들을 입증합니다.

제안 방법

검색 공간을 공유하는 외부 SP-VLM을 사용하여 예측 VLM(P-VLM)에 플러그-앤-플레이 선택적 예측(PaPSP) 모듈을 부착합니다.
검색 데이터셋에서 이미지-텍스트 쌍을 검색하여 쿼리에 대한 프록시 임베딩(proxy embeddings)을 최근접 이웃 평균으로 형성하는 방식으로 메모리 증강을 도입합니다.
쿼리와 예측된 캡션 간의 유사도를 하드 네가티브 대비하여 보정 개선(대조 정규화)을 수행하는 대조 점수를 계산합니다.
쿼리, 프록시, 점수 유형의 변형을(이미지-텍스트, 텍스트-텍스트, 단일모달 또는 교차모달) 표 1에 요약된 대로 사용하고 프록시 임베딩(Eq. 6) 및 대조 점수(Eq. 8)에 대한 방정식을 제시합니다.
불안정성과 보정 문제를 해결하기 위해 일반 CLIP 유사 점수를 프록시 기반 또는 대조 점수로 선택적으로 교체합니다(그림 2 및 그림 3).
여러 P-VLM 및 데이터셋에서 선택적 캡션 생성, 이미지-텍스트 매칭 및 분류에 대해 MA-PaPSP를 평가합니다.

Figure 1: PaPSP uses an external representation model and the CLIP score to enable selective prediction for VLM tasks like captioning without training. MA-PaPSP augments this model with an external dataset, which is leveraged to estimate proxy embeddings of greater stability and better calibrated co

실험 결과

연구 질문

RQ1PaPSP를 훈련 없이 개방형 집합의 경계 없는 어휘 태스크로 확장하려면 어떻게 해야 합니까?
RQ2검색과 대조 정규화를 통한 메모리 증강이 SP-VLM의 임베딩을 안정화시키고 점수를 보정할 수 있습니까?
RQ3서로 다른 작업 유형(캡션 생성, ITM, 분류) 및 다양한 P-VLM/SP-VLM 구성에서도 MA-PaPSP의 이점이 유지됩니까?
RQ4검색 데이터셋 유형(도메인 내, 도메인 외, 혼합)이 선택적 예측 성능에 미치는 영향은 무엇입니까?
RQ5MA-PaPSP가 선택적 예측에서 LLM 기반 심판보다 더 효율적입니까?

주요 결과

MA-PaPSP는 여러 데이터셋에서 캡션 생성, ITM 및 분류에 걸쳐 PaPSP를 일관되게 능가합니다.
더 큰 SP-VLM은 일반적으로 더 나은 성능을 보이며 MA-PaPSP의 이점은 더 큰 모델에서 증가합니다.
도메인 외 및 혼합 검색 데이터셋은 일반적으로 도메인 내 데이터보다 더 강한 개선을 보이며, 특히 캡션 생성 및 ITM과 같은 개방형 집합 태스크에서 그렇습니다.
소형 SP-VLM를 사용하는 MA-PaPSP가 훨씬 큰 SP-VLM를 사용하는 PaPSP를 능가할 수 있으며, 효율성 측면에서 장점이 강조됩니다.
제안된 대조 점수는 SP-VLM 임베딩 공간 전반에서 비대조 점수보다 더 안정적이고 보정된 확신도를 생성합니다.
검색 증강 프록시 임베딩은 표현 불안정성을 줄이고 선택적 예측 신뢰성을 향상시킵니다.

Figure 2: Left: VLM problems. a) instability of representations: the representations of images (orange) and texts (blue) of the same concept can vary significantly, leading to unreliable similarity scores. b) poor calibration: distances between concepts of identical similarity (red ellipses) vary ac

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.