QUICK REVIEW

[论文解读] Scaling Vision Transformers to 22 Billion Parameters

Mostafa Dehghani, Josip Djolonga|arXiv (Cornell University)|Feb 10, 2023

Multimodal Machine Learning Applications被引用 118

一句话总结

本文提出 ViT-22B，一种具有稳定性和效率性架构改动的 22B 参数视觉变换器，在分类、零-shot、密集预测、视频以及公平性/鲁棒性基准等方面实现了最先进或具竞争力的结果。

ABSTRACT

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

研究动机与目标

展示能够训练出 22B 参数 Vision Transformer (ViT-22B) 的可扩展训练技术。
在多样化任务上评估 ViT-22B：图像分类、零-shot 迁移、密集预测和视频。
分析模型规模扩大对公平性、鲁棒性、校准和人类对齐效应的影响。
表明大规模 ViTs 通过蒸馏到较小的骨干网络可以成为有效的教师。

提出的方法

引入并行层、QK 归一化，并省略偏置，以在大规模下稳定并加速训练。
采用异步模型并行方法，使用 TPUv4 上的 2D 网格分片以最大化吞吐量（1.15k tokens/sec/core）。
对模型参数和激活进行分片，以容纳大模型和大批量，同时实现计算与通信的重叠。
在一个 4B 图像的 JFT 派生数据集上进行预训练，每张图像 256 个 token，177k 步骤的时间表。
在多个下游任务中使用线性探测、锁定图像微调以及端到端微调进行评估。

实验结果

研究问题

RQ1带有架构改动的 ViT-规模训练是否能为 ViT-22B 实现稳定、高效的训练？
RQ2与先前的 ViT 和基于 LiT 的方法相比，ViT-22B 在标准及分布外的图像分类任务中的表现如何？
RQ3扩大 ViT 是否提升零-shot、迁移和跨域鲁棒性、公平性以及人类对齐指标？
RQ4将 ViT-22B 用作冻结骨干网络时，能否为密集预测和视频任务提供强特征表示？

主要发现

ViT-22B 作为冻结特征提取器在 ImageNet 上获得较强的表现（89.5% 的准确率），且在零-shot ImageNet 中达到 85.9%，配有匹配文本塔。
将 ViT-22B 蒸馏为 ViT-B/16 和 ViT-L/16，导致这两个较小模型在 ImageNet 上达到最先进的准确率（分别为 88.6% 和 89.6%）。
在 ObjectNet 的零-shot 结果随模型规模提升而提升，为这个具有挑战性的数据集设定了 ViT-22B 的新状态最优（SOTA）。
ViT-22B 显示出对人类形状偏好更好的对齐（87% 形状偏向），在子群和校准指标上的公平性/鲁棒性权衡也更优。
密集预测迁移（ADE20k 少样本）和单目深度估计从 ViT-22B 特征中受益，优于 ViT-L 和 ViT-G 基线。
使用冻结的 ViT-22B 主干进行视频评估，与先前的 4B 参数模型相比具有竞争力的结果，但对充分微调仍有提升空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。