QUICK REVIEW

[논문 리뷰] GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

Xin Li, Dongze Lian|arXiv (Cornell University)|2023. 09. 24.

Multimodal Machine Learning Applications인용 수 21

한 줄 요약

GraphAdapter는 텍스트 기반 어댑터를 비전-언어 모델 튜닝에서 안내하기 위해 텍스트적 및 시각적 이중 지식 그래프를 도입하고, GCN을 이용해 내부-모달리티 및 교차-모달리티 구조를 융합하여 11개 벤치마크에서 소수-shot에서의 성능을 향상시킵니다.

ABSTRACT

Adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs) under the low-data regime, where only a few additional parameters are introduced to excavate the task-specific knowledge based on the general and powerful representation of VLMs. However, most adapter-style works face two limitations: (i) modeling task-specific knowledge with a single modality only; and (ii) overlooking the exploitation of the inter-class relationships in downstream tasks, thereby leading to sub-optimal solutions. To mitigate that, we propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge (i.e., the correlation of different semantics/classes in textual and visual modalities) with a dual knowledge graph. In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively. This enables the textual feature of each prompt to leverage the task-specific structure knowledge from both textual and visual modalities, yielding a more effective classifier for downstream tasks. Extensive experimental results on 11 benchmark datasets reveal that our GraphAdapter significantly outperforms previous adapter-based methods. The code will be released at https://github.com/lixinustc/GraphAdapter

연구 동기 및 목표

모든 매개변수를 조정하지 않고도 저데이터 환경에서 VLM의 효율적 전이 학습을 추진한다.
텍스트 및 시각적 구조 지식을 모두 사용하여 작업 특화 지식을 모델링한다.
그래프 합성곱 신경망으로 텍스트 어댑터를 정보화하기 위해 이중 모달리티 그래프를 활용한다.
다양한 데이터셋에서 기존의 어댑터 기반 및 프롬프트 기반 ETL 방법보다 우수한 성능을 입증한다.

제안 방법

의미와 클래스 간 관계를 저장하기 위해 텍스트 서브그래프와 시각 서브그래프로 이루어진 이중 지식 그래프를 정의한다.
클래스별 평균 프롬프트로 텍스트 노드를 구성하고 텍스트 특징의 코사인 유사도를 이용해 간선들을 만든다.
클래스별 평균 시각 특징으로 시각 노드를 구성하고 시각 특징의 코사인 유사도를 이용해 간선들을 만든다.
텍스트 특징 z_t를 텍스트 그래프와 시각 그래프 모두에서 GCN을 통해 변형시켜 풍부한 표현을 얻는다.
학습 가능한 융합 가중치 beta로 내부 모달리티 및 교차 모달리티 구조 지식을 융합하고 가중치 alpha를 갖는 잔여(레지듀얼) 어댑터를 적용한다.
교차 엔트로피 손실로 분류를 최적화하면서 GCN만 학습한다.

실험 결과

연구 질문

RQ1명시적인 이중 모달리티 구조 그래프가 소수-shot 설정에서 VLM의 작업 특화 지식 추출을 향상시킬 수 있는가?
RQ2텍스트 및 시각 그래프의 통합 및 이들 간 상호작용이 텍스트 어댑터의 품질에 어떤 영향을 미치는가?
RQ3다운스트림 분류를 위한 텍스트 구조 지식과 시각 구조 지식의 상대적 중요성은 무엇인가?

주요 결과

GraphAdapter는 11개의 소수-shot 벤치마크에서 prior ETL 방법(Prompt/Adapter 스타일 등)을 지속적으로 능가한다.
16-shot 평가에서 GraphAdapter는 평균 76.22%(일부 기본값은 75.65–76.87% 대비) 를 달성하며 FGVCAircraft 같은 미세한 구체 데이터셋에서 주목할 만한 이득을 보인다.
절개 분석은 텍스트 지식 서브 그래프가 시각 그래프보다 더 중요하다는 것을 보여주지만, 두 그래프를 함께 모델링하면 최상의 결과를 얻는다.
GraphAdapter는 여러 CLIP 백본(ResNet-50/101, ViT-B/32, ViT-B/16)에서 일반화되며 도메인 간 테스트(ImageNet-V2, -Sketch, -A, -R)에서도 이득을 유지한다.
GCN과 잔여 융합을 통한 이중 모달리티 구조 지식의 활성화가 이전 어댑터들보다 성능을 향상시키는 핵심이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.