[论文解读] Sequencer: Deep LSTM for Image Classification
Sequencer 提出了一种基于 LSTM 的架构,作为 ViT 的替代方案,其中包括一个具有 54M 参数的 2D Sequencer2D-L 变体,在 ImageNet-1K 上达到 84.6% 的 top-1。
In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.
研究动机与目标
- 激发在计算机视觉中对自注意力以外的归纳偏置进行探索。
- 引入 Sequencer,一种用于图像分类的深度 LSTM 架构。
- 提出一个包含垂直和水平 LSTM 的 2D Sequencer 模块,以捕获长程依赖关系。
- 展示在 ImageNet-1K 上的竞争性性能,并讨论迁移性与分辨率鲁棒性。
提出的方法
- 用 LSTM 而非自注意力来建模长程依赖。
- 将 LSTM 分解为竖直和水平分量于 2D Sequencer(Sequencer2D),以提升性能。
- 报道一个模型变体(Sequencer2D-L),具备 54M 参数,在 ImageNet-1K 上达到 84.6% 的 top-1。
- 评估跨数据集的迁移性以及对双分辨率带输入的鲁棒性。
实验结果
研究问题
- RQ1基于 LSTM 的架构是否能在图像分类任务中与 Vision Transformers 和 MLP-Mixer 相媲美?
- RQ2具有垂直和水平 LSTMs 的 2D Sequencer 模块是否能提升图像分类中相较于普通 LSTM 的性能?
- RQ3Sequencer 模型在标准基准测试上的迁移性与分辨率自适应性特征是什么?
- RQ4ImageNet-1K 上 Sequencer2D-L 的参数量与精度之间有哪些权衡?
主要发现
- Sequencer 为图像分类提供了一个有效的基于 LSTM 的 ViT 替代方案。
- Sequencer2D-L 在 ImageNet-1K 上以 54M 参数达到 84.6% 的 top-1 准确率。
- 该模型在双分辨率带输入下呈现出良好的迁移性与鲁棒性能。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。