QUICK REVIEW

[论文解读] Efficient Training of Visual Transformers with Small Datasets

Yahui Liu, Enver Sangineto|arXiv (Cornell University)|Jun 7, 2021

Advanced Neural Network Applications被引用 84

一句话总结

本论文在小数据集上分析 Visual Transformers (VTs)，并提出一种自监督的密集相对定位损失来正则化 VT 训练，在数据有限的情况下提升准确性。它在多种 VT 架构和数据集上显示出稳定的提升，有时甚至显著。

ABSTRACT

Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. Our code is available at: https://github.com/yhlleo/VTs-Drloc.

研究动机与目标

比较不同第二代 Visual Transformers 在从零开始训练或数据受限时的鲁棒性。
引入一个自监督的辅助任务，以在不需要额外标注的情况下对 VT 训练进行正则化。
在多样化的数据集和训练方案上评估所提方法，以量化提升。

提出的方法

将图像表示为来自 VT 的最终 k×k 网格嵌入，并附加一个轻量级的 MLP 来预测相对嵌入距离。
定义一个密集相对定位损失，该损失对嵌入对进行采样并回归它们在归一化的二维网格上的距离到一个目标偏移。
将 L_drloc 与标准交叉熵结合，作为一个带有固定权重 lambda 的多任务目标。
使用 7×7 网格进行定位任务，以确保跨架构的稳定收敛。
将定位 MLP 应用于最终 token 嵌入，而不改变基准 VT 架构。

实验结果

研究问题

RQ1不同的第二代 Visual Transformers 在小型或中等数据集上的表现如何，与彼此及 ResNets 相比？
RQ2当数据稀缺或存在领域迁移时，自监督的辅助任务能否改善 VT 训练？
RQ3所提出的密集相对定位损失是否与各种 VT 架构和训练方案（从零开始或微调）广泛兼容？

主要发现

VTs 在小数据集上的性能差异很大，尽管在 ImageNet 上结果相近。
CvT 在多个数据集上在小数据集情形下往往比 Swin 或 T2T 更具鲁棒性。
添加密集相对定位损失 (L_drloc) 在各架构和数据集上稳定提升 VT 的准确性，在某些情况下甚至达到大幅度提升（有时高达 45 点）。
L_drloc 在从零开始训练或训练轮次有限时提供显著的正则化作用，并且对 ResNets 也有适度的收益。
该方法易于接入现有的 VT，且不依赖额外标注。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。