QUICK REVIEW

[论文解读] Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

Jaesin Ahn, Jiuk Hong|arXiv (Cornell University)|Nov 18, 2021

Machine Learning and ELM被引用 2

一句话总结

该论文在极小模型参数约束下，为视觉Transformer（ViT）提出了三种新颖的非线性、共享且基于代码的查询（Q）、键（K）和值（V）嵌入结构。通过用可学习的非线性映射（尤其是通过共享层和可训练代码参数）替代标准的线性投影，该方法在显著减少参数量的同时提升了图像分类准确率，在仅使用310万个参数的情况下，ImageNet-1k数据集上的top-1准确率达到71.4%，优于原始的XCiT-N12模型。

ABSTRACT

A vision transformer (ViT) is the dominant model in the computer vision field. Despite numerous studies that mainly focus on dealing with inductive bias and complexity, there remains the problem of finding better transformer networks. For example, conventional transformer-based models usually use a projection layer for each query (Q), key (K), and value (V) embedding before multi-head self-attention. Insufficient consideration of semantic $Q, K$, and $V$ embedding may lead to a performance drop. In this paper, we propose three types of structures for $Q$, $K$, and $V$ embedding. The first structure utilizes two layers with ReLU, which is a non-linear embedding for $Q, K$, and $V$. The second involves sharing one of the non-linear layers to share knowledge among $Q, K$, and $V$. The third proposed structure shares all non-linear layers with code parameters. The codes are trainable, and the values determine the embedding process to be performed among $Q$, $K$, and $V$. Hence, we demonstrate the superior image classification performance of the proposed approaches in experiments compared to several state-of-the-art approaches. The proposed method achieved $71.4\%$ with a few parameters (of $3.1M$) on the ImageNet-1k dataset compared to that required by the original transformer model of XCiT-N12 ($69.9\%$). Additionally, the method achieved $93.3\%$ with only $2.9M$ parameters in transfer learning on average for the CIFAR-10, CIFAR-100, Stanford Cars datasets, and STL-10 datasets, which is better than the accuracy of $92.2\%$ obtained via the original XCiT-N12 model.

研究动机与目标

为解决因对语义Q、K和V嵌入考虑不足而导致的极小规模视觉Transformer性能下降问题。
通过重新思考Q、K和V嵌入机制，在严格参数约束下提升ViT性能。
通过共享非线性层和可学习代码参数，探索Q、K和V之间的知识共享。
证明非线性与共享的QKV嵌入结构可同时提升ImageNet分类与迁移学习性能。

提出的方法

引入基于两层ReLU的非线性嵌入，将输入token映射到独立的非线性空间，用于Q、K和V。
提出单层共享结构，即一个非线性层在Q、K和V之间共享，以促进知识迁移。
设计两层共享结构，引入可训练的代码参数（Cq、Ck、Cv），用于定义Q、K和V的嵌入变换。
使用反向传播联合训练代码参数，以最小化ViT分类损失。
采用F-SNE可视化分析代码的相似性与正交性，证实代码学习到的是不同且与任务无关的特征。
在参数约束下，于ImageNet-1k和迁移学习基准（CIFAR-10、CIFAR-100、Stanford Cars、STL-10）上评估性能。

实验结果

研究问题

RQ1在极小模型参数约束下，用非线性映射替代线性QKV投影是否能提升ViT性能？
RQ2在Q、K和V之间共享非线性层是否能增强特征学习与分类准确率？
RQ3能否通过联合定义Q、K和V嵌入的可学习代码参数，实现优于独立投影的性能？
RQ4所提出的QKV嵌入结构在ImageNet和迁移学习任务上的性能，与SOTA模型如XCiT-N12相比如何？

主要发现

所提方法在仅使用310万个参数的情况下，ImageNet-1k数据集上的top-1准确率达到71.4%，优于原始XCiT-N12模型的69.9%。
在迁移学习任务中，该方法在CIFAR-10、CIFAR-100、Stanford Cars和STL-10上的平均准确率达到93.3%，超过原始XCiT-N12模型的92.2%。
基于代码的共享结构在可训练参数下表现优异，尤其在CIFAR-100和STL-10上，表明其在跨任务中具备有效的特征学习能力。
F-SNE可视化结果表明，所学习的代码（Cq、Ck、Cv）具有近似正交性，说明其学习到的是独立且与任务无关的表示。
代码的l2-范数在不同数据集（ImageNet、Cars、STL-10）上保持一致，但在CIFAR-10和CIFAR-100上有所不同，表明代码具备数据集特定的自适应能力。
对于纳米模型，代码大小为8时性能最优；对于极小模型，代码大小为16时性能最优，表明代码大小应随嵌入维度相应调整。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。