QUICK REVIEW

[论文解读] Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin, Maximilian Beck|arXiv (Cornell University)|Jun 6, 2024

Infrared Target Detection Methodologies被引用 20

一句话总结

Vision-LSTM (ViL) 将 xLSTM 架构应用于视觉领域，使用交替的 mLSTM 块按行对 patch tokens 进行相反方向处理，作为通用骨架，复杂度接近线性，在 ImageNet、ADE20K 和 VTAB-1K 上显示出竞争力的结果。

ABSTRACT

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

研究动机与目标

以面向语言建模设计的 xLSTM 架构为基础，激发并探索一个用于计算机视觉的通用骨架。
将 xLSTM 调整为在交替的遍历方向中处理图像 patch token，以应对非自回归的视觉输入。
在 ImageNet-1K 预训练、ADE20K 语义分割和 VTAB-1K 转移分类上评估 ViL，以评估其与现有骨架的竞争力。

提出的方法

将图像分割成不重叠的 patch，并线性投影以获得带可学习位置嵌入的 patch tokens。
将 ViL 构建为一系列交替的 mLSTM 块；奇数块从左上到右下遍历 patch tokens，偶数块从右下到左上遍历。
在每个 mLSTM 块内使用带协方差更新的矩阵记忆，并实现完全并行化计算。
分类使用首个与最后一个 patch token 的双向连接（不强制使用 CLS token）。
通过将因果 1D 卷积替换为 2D 卷积来将 xLSTM 调整为视觉任务，并可选在投影和层归一化中包含偏置以提升稳定性和准确性。

Figure 1: Schematic overview of Vision-LSTM (ViL). Following ViT [ 18 ] , an input image is split into patches and linearly projected. Then, a learnable vector is added per position to the patches, producing a sequence of patch tokens. This sequence is then processed by alternating mLSTM blocks wher

实验结果

研究问题

RQ1基于 xLSTM 的块是否可以成为超越语言建模的视觉任务的通用骨架？
RQ2哪些架构设计选择（方向性、参数共享、池化/分类设计）在标准视觉基准上为 ViL 带来最佳性能？
RQ3与优化的 ViT 和视觉骨架相比，ViL 在 ImageNet-1K、ADE20K 和 VTAB-1K 上的表现如何？
RQ4相对于其他骨架，ViL 在不同规模下的计算特性（FLOPs、运行时）如何？

主要发现

ViL 在 ImageNet-1K 预训练上取得具有竞争力的结果，在微小和小尺度上超过了若干优化的 ViT 方案，在较大尺度下仍然表现强势。
在 ADE20K 上，ViL-S 和 ViL-B 获得比若干基线更高的 mIoU 和 ACC，其中 ViL-B 与某些 DeiT 变体相当或超越。
在 VTAB-1K 转移上，ViL 在自然、专业和结构化数据集的平均表现优于若干基线，在结构化数据上特别突出。
双向交替块设计在提升性能的同时，保持较多方向块相比的计算效率；四向变体在运行时间成本显著的情况下提供更高精度。
分类设计在池化策略上具有鲁棒性；双边拼接（首个+最后一个 token）在不依赖 CLS token 的情况下也能取得良好结果。

Figure 2: Performance overview of ImageNet-1K pre-trained models in relation to pre-training compute. ViL shows strong performances across classification and semantic segmentation tasks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。