QUICK REVIEW

[논문 리뷰] MiniVLM: A Smaller and Faster Vision-Language Model

Jianfeng Wang, Xiaowei Hu|arXiv (Cornell University)|2020. 12. 13.

Multimodal Machine Learning Applications참고 문헌 57인용 수 29

한 줄 요약

MiniVLM은 OSCAR${}_{\text{B}}$와 같은 최신 기술 모델의 94–97%의 정확도를 달성하면서도 모델 크기를 73% 감소시키고 FLOPs를 99% 감소시킨 작고 효율적인 시각-언어 모델이다. 이 모델은 빠른 시각적 특징 추출을 위해 이중단계 효율적 특징 추출기(Two-stage Efficient feature Extractor, TEE)를 사용하고, 가짜 레이블이 부여된 Open Images와 고품질 이미지 태그를 활용한 사전학습을 통해 개선된 MiniLM 기반 트랜스포머를 활용한다.

ABSTRACT

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding $7M$ Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by $73\%$ and the inference time cost by $94\%$ while being able to retain $94-97\%$ of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.

연구 동기 및 목표

자원 제약이 있는 장치에 배포 가능한 경량 시각-언어 모델을 개발하는 것.
하류 작업 성능을 훼손하지 않으면서도 시각적 특징 추출의 계산 비용을 줄이는 것.
대규모 모델과 대규모 데이터셋을 활용하여 소형 모델의 사전학습을 향상시키는 것.
최소한의 파라미터 수와 추론 비용으로 높은 정확도를 달성하여 엣지 장치에의 배포를 가능하게 하는 것.

제안 방법

Faster R-CNN에 비해 시각적 특징 추출 비용을 99% 감소시키기 위해 EfficientDet를 영감으로 삼은 이중단계 효율적 특징 추출기(Two-stage Efficient feature Extractor, TEE)를 설계한다.
성능을 유지하면서도 계산을 최소화하기 위해 시각-언어 트랜스포머에 MiniLM 아키텍처를 사용한다.
최신 기술의 캡션 모델이 생성한 700만 개의 가짜 레이블이 부여된 Open Images 데이터를 사용해 MiniVLM을 사전학습한다.
사전학습 중 교차모달 정렬을 향상시키기 위해 강력한 태깅 모델에서 유래한 고품질 이미지 태그를 통합한다.
사전학습 데이터 생성 및 distillation에만 대규모 모델을 사용하고, 추론 및 미세조정 시에는 이를 분리함으로써 대규모 모델을 추론 및 미세조정 단계에서 제거한다.
영역 헤드 구성 요소를 단순화하고 표준 컨볼루션을 depthwise 및 pointwise 컨볼루션으로 대체함으로써 시각 모듈을 최적화한다.

실험 결과

연구 질문

RQ1대규모 모델의 대부분의 성능을 유지하면서도 상당히 작고 빠른 시각-언어 모델을 구현할 수 있는가?
RQ2시각-언어 작업에서 경량의 이중단계 검출기의 시각적 특징 추출에 얼마나 효과적인가?
RQ3가짜 레이블이 부여된 데이터와 고품질 태그를 활용한 사전학습이 소형 모델 성능 향상에 얼마나 기여하는가?
RQ4시각-언어 모델에서 모델 크기, FLOPs, 정확도 사이의 최적의 트레이드오프는 무엇인가?

주요 결과

COCO 이미지 캡션 작업에서 MiniVLM은 파라미터 수가 27%에 불과한데도 불구하고 OSCAR${}_{\text{B}}$의 CIDEr 점수의 97%를 기록했다(119.8 대비 123.7).
여러 하류 작업에서 정확도를 94–97% 유지하면서도 FLOPs를 99% 감소시켜 OSCAR${}_{\text{B}}$의 1%로 줄였다.
사전학습 중 고품질 이미지 태그를 사용할 경우, 태그 없이 학습한 경우에 비해 CIDEr 점수는 2점 이상, VQA 정확도는 1점 이상 향상되었다.
EfficientDet-D0와 유사한 백본을 가진 TEE-0는 R101 Faster R-CNN보다 3.7배 작고 99배 빠르며, Visual Genome에서 유사한 검출 mAP 성능을 보였다.
MiniLM 기반 트랜스포머는 시각-언어 작업에서 다른 경량 BERT 변종에 비해 속도-정확도 트레이드오프에서 뛰어난 성능을 보였다.
트랜스포머를 무작위 초기화한 경우와 텍스트 사전학습된 가중치를 사용한 경우의 성능가 비슷한 결과를 보였으며, 이는 소형 모델이 자기주도 사전학습을 통해 효과적으로 학습할 수 있음을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.