QUICK REVIEW

[논문 리뷰] Linearly Mapping from Image to Text Space

Jack Merullo, Louis Castricato|arXiv (Cornell University)|2022. 09. 30.

Multimodal Machine Learning Applications인용 수 25

한 줄 요약

본 논문은 단일 선형 프로젝션이 이미지 표현을 고정된 언어 모델의 입력 공간으로 매핑하여 캡션을 생성하고 질문에 답하는 데 사용할 수 있으며, LM이나 이미지 인코더를 튜닝하지 않고도 경쟁력 있는 비전-언어 성능을 달성한다는 것을 보여준다. 성능은 사전학습 시 이미지 인코더의 언어적 감독에 의존한다.

ABSTRACT

The extent to which text-only language models (LMs) learn to represent features of the non-linguistic world is an open question. Prior work has shown that pretrained LMs can be taught to caption images when a vision model's parameters are optimized to encode images in the language space. We test a stronger hypothesis: that the conceptual representations learned by frozen text-only models and vision-only models are similar enough that this can be achieved with a linear map. We show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection. Using these to prompt the LM achieves competitive performance on captioning and visual question answering tasks compared to models that tune both the image encoder and text decoder (such as the MAGMA model). We compare three image encoders with increasing amounts of linguistic supervision seen during pretraining: BEIT (no linguistic information), NF-ResNET (lexical category information), and CLIP (full natural language descriptions). We find that all three encoders perform equally well at transferring visual property information to the language model (e.g., whether an animal is large or small), but that image encoders pretrained with linguistic supervision more saliently encode category information (e.g., distinguishing hippo vs. elephant) and thus perform significantly better on benchmark language-and-vision tasks. Our results indicate that LMs encode conceptual information structurally similarly to vision-based models, even those that are solely trained on images. Code is available here: https://github.com/jmerullo/limber

연구 동기 및 목표

고정된 텍스트 전용 LMs가 선형으로 매핑된 이미지 표현(소프트 프롬프트)을 입력으로 사용할 때 이미지를 설명할 수 있는지 테스트한다.
다양한 언어적 감독을 가진 서로 다른 이미지 인코더가 시각적 개념을 LM으로 어떻게 전달하는지 조사한다.
LiMBeR를 엔드투엔드 튜닝의 필요성 평가를 위해 함께 튜닝된 멀티모달 기준선과 비교한다.

제안 방법

이미지 인코더 표현에서 LM 입력 공간으로의 단일 선형 프로젝션 P를 학습하여 이미지 프롬프트를 생성한다.
이미지 인코더 E와 언어 모델 LM을 고정한 채 이미지 캡션 생성 목표에서 P를 학습한다.
LM이나 인코더 미세조정 없이 VL 작업에서 LM에 프롬프트를 주어 캡션 생성 및 질문에 답하도록 하여 전이를 평가한다.
CLIP RN50x16, NF-ResNet50, BEIT-Large(및 변형) 등 서로 다른 사전학습 언어적 감독을 가진 여러 인코더를 테스트한다.
학습에 Conceptual Captions 3M를 사용하고 MAGMA 및 NFRN50 변형을 포함한 기준선과 비교한다.

실험 결과

연구 질문

RQ1이미지 인코더와 고정된 LM 사이의 선형 매핑이 LM이나 인코더 가중치를 업데이트하지 않고도 정확한 이미지 캡션 작성과 VQA를 지원할 수 있는가?
RQ2이미지 인코더 사전학습에서의 언어적 감독의 양이 LM으로의 전이에 어떤 영향을 미치는가?
RQ3비전 공간과 언어 공간 간의 표현 유사성이 서로 다른 인코더 간의 효과적인 제로샷 전이에도 충분한가?
RQ4다른 인코더에서 선형 프롬프트를 통해 시각 정보를 전이할 때 어떤 오류와 한계가 발생하는가?

주요 결과

선형 투영은 고정된 LM에 시각 정보를 전달하여 캡션을 생성하고 질문에 대답하는 데 경쟁력 있는 성능을 얻을 수 있다.
성능은 이미지 인코더의 언어적 감독과 상관관계가 있다; CLIP 및 NF-ResNet50이 BEIT보다 많은 VL 작업에서 우수하지만 BEIT도 거친 인지 정보를 여전히 전달한다.
이미지 인코더나 LM를 미세조정하는 것이 선형 프로젝션 만을 학습하는 LiMBeR 기본값에 비해 일관되게 유리하지 않다.
언어적으로 감독된 인코더는 어휘 범주 개념의 전이를 가능하게 하는 반면, 시각 정보만 가진 인코더는 주로 거친 지각 정보를 전달한다.
BEIT 프롬프트는 더 모호한 캡션을 생성하고 정확한 어휘 분류에 어려움을 겪지만 LM과의 지각적 유사성을 여전히 전달한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.