QUICK REVIEW

[论文解读] MiniVLM: A Smaller and Faster Vision-Language Model

Jianfeng Wang, Xiaowei Hu|arXiv (Cornell University)|Dec 13, 2020

Multimodal Machine Learning Applications参考文献 57被引用 29

一句话总结

MiniVLM 是一种紧凑高效的视觉语言模型，在参数量减少 73%、FLOPs 降低 99% 的情况下，实现了与 SOTA 模型（如 OSCAR${}_{\text{B}}$）94–97% 的准确率。它采用两阶段高效特征提取器（TEE）实现快速视觉特征提取，并基于 MiniLM 的 Transformer 架构，通过使用伪标签 Open Images 数据和高质量图像标签进行预训练以增强性能。

ABSTRACT

Recent vision-language (VL) studies have shown remarkable progress by learning generic representations from massive image-text pairs with transformer models and then fine-tuning on downstream VL tasks. While existing research has been focused on achieving high accuracy with large pre-trained models, building a lightweight model is of great value in practice but is less explored. In this paper, we propose a smaller and faster VL model, MiniVLM, which can be finetuned with good performance on various downstream tasks like its larger counterpart. MiniVLM consists of two modules, a vision feature extractor and a transformer-based vision-language fusion module. We design a Two-stage Efficient feature Extractor (TEE), inspired by the one-stage EfficientDet network, to significantly reduce the time cost of visual feature extraction by $95\%$, compared to a baseline model. We adopt the MiniLM structure to reduce the computation cost of the transformer module after comparing different compact BERT models. In addition, we improve the MiniVLM pre-training by adding $7M$ Open Images data, which are pseudo-labeled by a state-of-the-art captioning model. We also pre-train with high-quality image tags obtained from a strong tagging model to enhance cross-modality alignment. The large models are used offline without adding any overhead in fine-tuning and inference. With the above design choices, our MiniVLM reduces the model size by $73\%$ and the inference time cost by $94\%$ while being able to retain $94-97\%$ of the accuracy on multiple VL tasks. We hope that MiniVLM helps ease the use of the state-of-the-art VL research for on-the-edge applications.

研究动机与目标

开发一种适用于资源受限设备部署的轻量化视觉语言模型。
在不牺牲下游任务性能的前提下，降低视觉特征提取的计算成本。
通过利用大模型和大规模数据集，改进小模型的预训练方法。
在极低参数量和推理成本下实现高准确率，支持边缘设备部署。

提出的方法

设计一种受 EfficientDet 启发的两阶段高效特征提取器（TEE），与 Faster R-CNN 相比，将视觉特征提取成本降低 99%。
采用 MiniLM 架构作为视觉语言 Transformer，以在保持性能的同时最小化计算量。
使用由 SOTA 描述模型生成的 700 万条伪标签 Open Images 数据对 MiniVLM 进行预训练。
在预训练过程中引入强模型生成的高质量图像标签，以增强跨模态对齐。
通过仅在预训练数据生成和蒸馏阶段使用大模型，实现大模型与推理及微调过程的解耦。
通过简化区域头组件并用深度可分离卷积和逐点卷积替代标准卷积，优化视觉模块。

实验结果

研究问题

RQ1能否在显著减小模型规模和提升推理速度的同时，保留大模型的大部分性能？
RQ2轻量化两阶段检测器在视觉语言任务中的视觉特征提取中效果如何？
RQ3使用伪标签数据和高质量标签进行预训练，能在多大程度上提升小模型的性能？
RQ4在视觉语言模型中，模型大小、FLOPs 和准确率之间的最优权衡是什么？

主要发现

MiniVLM 在 COCO 图像字幕任务上的 CIDEr 得分为 119.8，达到 OSCAR${}_{\text{B}}$（123.7）的 97%，尽管仅使用了其 27% 的参数量。
该模型在多个下游任务中保持 94–97% 的准确率，同时将 FLOPs 降低 99%（仅相当于 OSCAR${}_{\text{B}}$ 的 1%）。
在预训练过程中引入高质量图像标签，使 CIDEr 得分提升超过 2 分，VQA 准确率提升超过 1 分，相较无标签设置。
TEE-0（其主干网络与 EfficientDet-D0 相似）的参数量仅为 R101 Faster R-CNN 的 1/3.7，推理速度提升 99 倍，且在 Visual Genome 上的检测 mAP 接近。
基于 MiniLM 的 Transformer 在视觉语言任务中，相比其他紧凑 BERT 变体，在速度-准确率权衡上表现更优。
随机初始化 Transformer 与使用文本预训练权重相比，性能相当，表明小模型可通过自监督预训练有效学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。