QUICK REVIEW

[논문 리뷰] Side Adapter Network for Open-Vocabulary Semantic Segmentation

Mengde Xu, Zheng Zhang|arXiv (Cornell University)|2023. 02. 23.

Multimodal Machine Learning Applications인용 수 14

한 줄 요약

SAN은 동결된 CLIP 모델에 가벼운 사이드 네트워크를 부착하여 마스크 제안과 CLIP 인식 편향 주의를 공동으로 생성하고, 엔드-투-엔드 개방 어휘 의미 분할을 강력한 효율성 및 정확도 향상과 함께 가능하게 합니다.

ABSTRACT

This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation. The code will be available at https://github.com/MendelXu/SAN.

연구 동기 및 목표

비전-언어 사전학습(CLIP)을 사용한 개방 어휘 의미 분할의 동기 부여.
고정된 CLIP 기반 상태를 유지하면서 엔드-투-엔드로 학습 가능한 경량 사이드 네트워크 도입.
마스크 제안 생성과 CLIP 기반 인식 간의 주의 편향으로 마스크 제안 생성을 분리합니다.
추가 매개변수 및 계산 최소화로 CLIP 인식 가능한 마스크 예측을 달성합니다.
다양한 벤치마크에서 효율성 우위를 통한 최첨단 성능 시연

제안 방법

고정 CLIP 모델에 사이드 어댑터 네트워크(SAN)를 두 개의 가지로 부착: 마스크 제안 생성과 마스크 인식용 주의 편향 예측.
비대칭 입력 해상도를 사용: CLIP 기반 인식을 위한 저해상도 CLIP 특징과 마스크 제안용 고해상도 SAN 입력.
CLIP의 시각 토큰을 SAN으로 융합하고 분리된 헤드를 적용하여 마스크 제안과 인식 편향을 생성.
세그먼테이션을 S = M P^T로 계산: M은 마스크 제안, P는 주의 편향에서 나온 클래스 점수.
마스크 예측(다이스 계수 및 BCE) 및 마스크 분류(교차 엔트로피) 손실로 엔드-투-엔드 학습.
필요 시 CLIP 위치 임베딩을 미세조정하고 프롬프트 엔지니어링으로 제로샷 인식을 향상시킵니다.

Figure 2 : Overview of our SAN . The red dotted lines indicate the gradient flow during training. In our framework, the frozen CLIP model still serves as a classifier, and the side adapter network generates mask proposals and attention bias to guide the deeper layers of the CLIP model to predict pro

실험 결과

연구 질문

RQ1큰 CLIP 모델을 세분화 데이터에 맞춰 미세조정하지 않고도 개방 어휘 의미 분할을 달성할 수 있는가?
RQ2고정된 CLIP 특성을 활용해 엔드-투-엔드 방식으로 CLIP 인식 가능한 마스크 제안과 인식 편향을 생성하는 경량 사이드 네트워크가 가능한가?
RQ3특징 융합 깊이, 입력 해상도 및 분리된 헤드가 성능과 효율성에 어떤 영향을 미치는가?
RQ4SAN이 벤치마크 전반에서 정확도와 효율성 측면에서 두 단계 또는 완전하게 조정된 CLIP 기반 접근법과 비교해 어떤 차이가 있는가?
RQ5프롬프트 엔지니어링이 개방 어휘 분할 성능에 어떤 영향을 미치는가?

주요 결과

방법	VL-모델	학습 데이터셋	앙상블 여부	ADE-847	PC-459	ADE-150	PC-59	VOC
SimSeg	CLIP ViT-B/16	COCO	no.	7.0	8.7	20.5	47.7	88.4
MaskCLIP	CLIP ViT-L/14	COCO	no.	8.2	10.0	23.7	45.9	-
OvSeg*	CLIP ViT-B/16	COCO	yes.	7.1	11.0	24.8	53.3	92.6
SAN(ours)	CLIP ViT-B/16	COCO	no.	10.1 ±0.23	12.6 ±0.44	27.5 ±0.34	53.8 ±0.57	94.0 ±0.21
SAN ensemble	CLIP ViT-B/16	COCO	yes.	10.7 ±0.22	13.7 ±0.34	28.9 ±0.42	55.4 ±0.11	94.6 ±0.11
SAN(ours)	CLIP ViT-L/14	COCO	no.	12.4 ±0.27	15.7 ±0.26	32.1 ±0.42	57.7 ±0.34	94.6 ±0.42
SAN ensemble	CLIP ViT-L/14	COCO	yes.	13.7 ±0.12	17.1 ±0.18	33.3 ±0.29	60.2 ±0.31	95.5 ±0.16

SAN은 ViT-L/14 CLIP를 사용할 때 ADE-847에서 mIoU 12.4, PC-459에서 15.7, ADE-150에서 32.1, PC-59에서 57.7, VOC에서 94.6으로 최첨단 성능을 달성하며 기존 방법을 능가합니다.
ViT-B/16를 사용하는 SAN은 CLIP 파인 튜닝 없이도 ADE-847에서 10.1 mIoU, PC-459에서 12.6, ADE-150에서 27.5, PC-59에서 53.8, VOC에서 94.0를 달성합니다.
COCO로 튜닝된 모델과 SAN을 앙상블하면 결과가 더욱 개선되어 ADE-847에서 13.7, PC-459에서 17.1, ADE-150에서 33.3, PC-59에서 60.2, VOC에서 95.5가 됩니다.
SAN은 학습 가능한 매개변수가 8.4M에 불과하고 GFLOPs가 64.3으로, 경쟁 방법에 비해 크게 적습니다.
깊은 CLIP 특징 융합과 분리된 헤드의 Ablation은 성능 향상에 기여하며, 엔드-투-엔드 CLIP 인식 마스크 예측이 강력한 결과의 핵심입니다.
프롬프트 엔지니어링은 ADE-150 및 ADE-847에서 약 1.2 mIoU의 측정 가능한 이점을 제공합니다.

Figure 3 : The architecture of the side adapter network. The side adapter network projects the input image to visual tokens and appends query tokens to them at the beginning. Further, it fuses the immediate features of the CLIP model in the middle of transformer layers. The query and visual features

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.