QUICK REVIEW

[논문 리뷰] Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Xiuye Gu, Tsung-Yi Lin|arXiv (Cornell University)|2021. 04. 28.

Multimodal Machine Learning Applications참고 문헌 39인용 수 280

한 줄 요약

ViLD는 오픈 어휘(Open-Vocabulary) 이미지 분류기로부터 지식을 증류하여 두 단계 검출기로 전달함으로써 오픈 어휘 물체 검출을 가능하게 하며, 새로운 카테고리의 높은 정밀도와 데이터셋 간의 전이성을 달성한다.

ABSTRACT

We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. It is costly to further scale up the number of classes contained in existing object detection datasets. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories that are not seen during training. ViLD obtains 16.1 mask AP$_r$ with a ResNet-50 backbone, even outperforming the supervised counterpart by 3.8. When trained with a stronger teacher model ALIGN, ViLD achieves 26.3 AP$_r$. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$ on PASCAL VOC, 36.6 AP on COCO and 11.8 AP on Objects365. On COCO, ViLD outperforms the previous state-of-the-art by 4.8 on novel AP and 11.4 on overall AP. Code and demo are open-sourced at https://github.com/tensorflow/tpu/tree/master/models/official/detection/projects/vild.

연구 동기 및 목표

새로운 카테고리에 대한 많은 탐지 어노테이션 없이 임의의 텍스트 입력으로 설명된 객체를 탐지하는 문제를 해결한다.
사전 학습된 오픈 어휘 이미지 분류기를 선생으로 활용해 두 단계 검출기를 감독한다.
ViLD 구성요소(ViLD-text 및 ViLD-image)를 개발하여 영역 임베딩을 텍스트 및 이미지 임베딩과 정렬한다.
LVIS에서 오픈 어휘 검출 성능을 시연하고 다른 검출 데이터셋으로의 전이성을 보여준다.

제안 방법

사전 학습된 오픈-어휘 모델의 텍스트 임베딩으로 두 단계 검출기의 표준 분류기를 대체해 ViLD-text를 형성한다.
사전 학습된 이미지 인코더의 이미지 임베딩을 Mask R-CNN의 영역 임베딩으로 증류하여 L1 손실을 사용해 ViLD-image를 형성한다.
ViLD-text와 ViLD-image를 공동 학습 목표 L_ViLD = L_ViLD-text + w * L_ViLD-image로 결합한다.
추론 시 기본 카테고리와 신규 카테고리 모두에 동일한 텍스트 임베딩을 사용하여 오픈-어휘 검출(C_B ∪ C_N)을 가능하게 한다.
기본/신규 카테고리 성능을 향상시키기 위해 모델 앙상블(ViLD-ensemble 또는 ViLD-text+CLIP)을 선택적으로 적용한다.
다른 선생 모델(CLIP, ALIGN)로의 증류와 미세조정 없이의 전이 가능성을 보여준다.

실험 결과

연구 질문

RQ1오픈-어휘 이미지 분류기로부터의 지식 증류가 효과적인 오픈-어휘 물체 검출을 가능하게 하는가?
RQ2텍스트 기반 및 이미지 기반 증류 신호가 신규 카테고리 탐지에서 서로 어떻게 보완하는가?
RQ3더 강력한 교사 모델(예: ALIGN)이 오픈-어휘 검출 성능에 미치는 영향은 무엇인가?
RQ4ViLD로 학습된 검출기가 다른 검출 데이터셋으로 미세조정 없이 얼마나 잘 전이되는가?

주요 결과

ViLD는 ResNet-50 백본으로 LVIS에서 16.1의 새로운 카테고리 AP(AP_r)를 달성하고, 지도 학습 상대 모델보다 3.8 AP_r 개선.
강력한 교사 모델 ALIGN를 사용하면 LVIS의 신규 카테고리에서 26.3 AP_r를 달성한다.
ViLD는 파스칼 VOC(72.2 AP50), COCO(36.6 AP), Objects365(11.8 AP)로 미세조정 없이 직접 전이된다.
ViLD는 COCO에서 이전 오픈-어휘 검출기보다 4.8 AP_r 및 전체 11.4 AP 향상.
ViLD-text(CLIP 텍스트 임베딩 사용)은 GloVe(10.1 vs 3.0)와 비교해 신규 카테고리 AP_r를 크게 향상시킨다.
ViLD는 텍스트 기반 증류와 이미지 기반 증류를 결합한 ViLD-text + ViLD-image로 신규 카테고리 성능을 향상시키는 이점을 얻는다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.