QUICK REVIEW

[论文解读] Florence: A New Foundation Model for Computer Vision

Lu Yuan, Dongdong Chen|arXiv (Cornell University)|Nov 22, 2021

Multimodal Machine Learning Applications参考文献 55被引用 340

一句话总结

Florence 是一个大尺度的视觉-语言基础模型，扩展表示范围从场景到对象，从图像到视频，以及从 RGB 到多模态，实现最先进的迁移能力和广泛任务适应性。

ABSTRACT

Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human vision. Computer vision foundation models, which are trained on diverse, large-scale dataset and can be adapted to a wide range of downstream tasks, are critical for this mission to solve real-world computer vision applications. While existing vision foundation models such as CLIP, ALIGN, and Wu Dao 2.0 focus mainly on mapping images and textual representations to a cross-modal shared representation, we introduce a new computer vision foundation model, Florence, to expand the representations from coarse (scene) to fine (object), from static (images) to dynamic (videos), and from RGB to multiple modalities (caption, depth). By incorporating universal visual-language representations from Web-scale image-text data, our Florence model can be easily adapted for various computer vision tasks, such as classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. Moreover, Florence demonstrates outstanding performance in many types of transfer learning: fully sampled fine-tuning, linear probing, few-shot transfer and zero-shot transfer for novel images and objects. All of these properties are critical for our vision foundation model to serve general purpose vision tasks. Florence achieves new state-of-the-art results in majority of 44 representative benchmarks, e.g., ImageNet-1K zero-shot classification with top-1 accuracy of 83.74 and the top-5 accuracy of 97.18, 62.4 mAP on COCO fine tuning, 80.36 on VQA, and 87.8 on Kinetics-600.

研究动机与目标

将计算机视觉基础模型定义为一个预训练模型加上跨时空模态轴的多任务适配器。
构建一个统一的、Web 规模的图像-文本预训练框架，采用双塔架构。
开发对象级、视频和视觉-语言任务的适配器，以实现广泛的迁移能力。
优化训练基础设施，使在大规模数据集上的预训练更高效地扩展。

提出的方法

Curate a 900 million image-text pair dataset (FLD-900M) with filtering and UniCL-based unified image-text contrastive learning.
Pretrain a two-tower Florence model with an image encoder (CoSwin/Hierarchical ViT) and a language encoder (12-layer transformer) using UniCL in an image-label-description space.
Extend representations to object-level via Dynamic Head adapters and FLOD-9M for object detection pretraining.
Incorporate V+L capabilities using METER adapter for fine-grained fusion and pretraining with ITM and MLM losses.
Adapt to video with Video CoSwin adapter by converting 2D to 3D tokens and adjusting attention/positional embeddings.
Demonstrate scalable training techniques (ZeRO, activation checkpointing, mixed precision, gradient cache) to enable large-batch, large-scale training.

实验结果

研究问题

RQ1在空间、时间和模态上，什么构成真正的计算机视觉基础模型？
RQ2单一的预训练模型结合轻量级适配器，是否能在零样本、少样本和全微调等设置下，在多样化的 CV 任务（分类、检索、检测、VQA、字幕生成、视频任务）上达到最先进的性能？
RQ3网络规模的图像-文本数据和统一的学习目标如何影响跨视觉任务与模态的迁移能力？

主要发现

Florence 在44个具有代表性的基准测试上取得新的最先进结果，其中 ImageNet-1K 零样本 Top-1 83.74、Top-5 97.18。
COCO 微调达到 62.4 mAP；VQA 得分达到 80.36；Kinetics-600 达到 87.8% 的准确率。
零样本迁移在12个分类任务中赢得9项，线性探测在评估集的11个数据集中的9项获得领先。
在 Flickr30K 与 MSCOCO 上的零样本图像-文本检索表现具有竞争力甚至优于现有方法，Florence 超越了以往的零样本方法。
使用 FLOD-9M 与 Dynamic Head 的对象检测在 COCO 及其他检测基准上获得强劲的 AP（例如在微调中的 COCO AP 62.0）。
Florence 在 CD-FSL 基准测试上显示出跨领域的强烈少样本结果，在多种设定中超越了此前的单模型基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。