[论文解读] SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration
SAM-Lightening 通过对 SAM 的编码器进行重新设计,使用 Dilated Flash Attention 和动态层次蒸馏,在保持分割质量的同时实现 ~30× 更快的推理速度和显著降低的内存占用。
Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. However, a broader application of SAMs to real-world practice has been restricted by their low inference speed and high computational memory demands, which mainly stem from the attention mechanism. Existing work concentrated on optimizing the encoder, yet has not adequately addressed the inefficiency of the attention mechanism itself, even when distilled to a smaller model, which thus leaves space for further improvement. In response, we introduce SAM-Lightening, a variant of SAM, that features a re-engineered attention mechanism, termed Dilated Flash Attention. It not only facilitates higher parallelism, enhancing processing efficiency but also retains compatibility with the existing FlashAttention. Correspondingly, we propose a progressive distillation to enable an efficient knowledge transfer from the vanilla SAM without costly training from scratch. Experiments on COCO and LVIS reveal that SAM-Lightening significantly outperforms the state-of-the-art methods in both run-time efficiency and segmentation accuracy. Specifically, it can achieve an inference speed of 7 milliseconds (ms) per image, for images of size 1024*1024 pixels, which is 30.1 times faster than the vanilla SAM and 2.1 times than the state-of-the-art. Moreover, it takes only 244MB memory, which is 3.5\% of the vanilla SAM. The code and weights are available at https://anonymous.4open.science/r/SAM-LIGHTENING-BC25/.
研究动机与目标
- 解决 Segment Anything Model (SAM) 在实际部署中的计算瓶颈。
- 引入一种使用 Dilated Flash Attention 的高效图像编码器,以加速推理并降低内存占用。
- 提出 Dynamic Layer-Wise Distillation (DLD),在无需从头训练的情况下将知识从 vanilla SAM 转移过来。
- 证明 SAM-Lightening 在 COCO 和 LVIS 上保持与基线分割性能的竞争力,同时效率显著提升。
提出的方法
- 设计一个 Dilated Flash Attention 机制以替代 vanilla 自注意力,从而实现更高的并行性和对分段的稀疏化。
- 应用 Dynamic Layer-Wise Distillation (DLD) 以逐步将知识从 SAM 传递给轻量化编码器。
- 使用解耦特征蒸馏,重点对接近输出的更深层特征进行对齐以匹配教师模型的表示。
- 对解码器进行微调,使其与轻量化编码器在基于提示的分割任务上对齐。
- 使用 1% SA-1B 数据进行训练,保存 SAM 编码器输出以加速蒸馏,并在标准基准上进行评估。
实验结果
研究问题
- RQ1重新设计的注意力机制(Dilated Flash Attention)是否能够在不牺牲准确性的前提下加速 SAM 的编码器?
- RQ2动态层次蒸馏是否能有效地将知识从 SAM 转移到轻量级编码器?
- RQ3在 COCO 和 LVIS 上,SAM-Lightening 的速度、内存和分割性能之间的权衡如何?
- RQ4在 Box、1P、3P 等提示下,以及 Anything 模式中,SAM-Lightening 与最先进的 SAM 变体相比如何?
主要发现
| 模型 | 编码 ms | 解码 ms | 总 ms | 加速比 | 内存 |
|---|---|---|---|---|---|
| SAM-ViT-H | 216.1 | 3.8 | 219.9 | 1.0× | 5.7GB |
| SAMFast | 23.2 | 3.8 | 27.0 | 8.5× | 4.1GB |
| FastSAM | 20.7 | 3.4 | 24.1 | 9.1× | 2.6GB |
| EfficientSAM | 22.3 | 3.8 | 26.1 | 8.3× | 309MB |
| MobileSAM | 8.1 | 3.8 | 11.9 | 18.5× | 309MB |
| SAM-Lightening | 3.5 | 3.4 | 6.9 | 30.1× | 224MB |
- SAM-Lightening 在 1024×1024 输入上实现了 7 ms/图像,即比 vanilla SAM 快 30.1×,且比最先进方法快 2.1×。
- 内存使用降至 224 MB,约为 vanilla SAM 的 3.5%。
- 推理延迟与内存效率在各种提示和数据集上均优于竞争的轻量化 SAM 变体。
- SAM-Lightening 在 COCO 和 LVIS 的 Box 和 Point 提示下,仍保持与 vanilla SAM 相当的分割性能。
- Dynamic Layer-Wise Distillation 通过分阶段的逐层加权和对输出附近的更深层特征的聚焦,实现高效的知识转移。
- 对解码器进行点/框提示的微调,使轻量化编码器与冻结的解码器保持对齐。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。