Skip to main content
QUICK REVIEW

[论文解读] Head-Free Lightweight Semantic Segmentation with Linear Transformer

Bo Dong, Pichao Wang|arXiv (Cornell University)|Jan 11, 2023
Advanced Neural Network Applications被引用 9
一句话总结

AFFormer 是一种无头的轻量级语义分割架构,使用并行异质设计结合原型学习和自适应频谱滤波,在 ADE20K 与 Cityscapes 上以极低 FLOPs 实现最先进精度。

ABSTRACT

Existing semantic segmentation works have been mainly focused on designing effective decoders; however, the computational load introduced by the overall structure has long been ignored, which hinders their applications on resource-constrained hardwares. In this paper, we propose a head-free lightweight architecture specifically for semantic segmentation, named Adaptive Frequency Transformer. It adopts a parallel architecture to leverage prototype representations as specific learnable local descriptions which replaces the decoder and preserves the rich image semantics on high-resolution features. Although removing the decoder compresses most of the computation, the accuracy of the parallel structure is still limited by low computational resources. Therefore, we employ heterogeneous operators (CNN and Vision Transformer) for pixel embedding and prototype representations to further save computational costs. Moreover, it is very difficult to linearize the complexity of the vision Transformer from the perspective of spatial domain. Due to the fact that semantic segmentation is very sensitive to frequency information, we construct a lightweight prototype learning block with adaptive frequency filter of complexity $O(n)$ to replace standard self attention with $O(n^{2})$. Extensive experiments on widely adopted datasets demonstrate that our model achieves superior accuracy while retaining only 3M parameters. On the ADE20K dataset, our model achieves 41.8 mIoU and 4.6 GFLOPs, which is 4.4 mIoU higher than Segformer, with 45% less GFLOPs. On the Cityscapes dataset, our model achieves 78.7 mIoU and 34.4 GFLOPs, which is 2.5 mIoU higher than Segformer with 72.5% less GFLOPs. Code is available at https://github.com/dongbo811/AFFormer.

研究动机与目标

  • 通过去除重型解码头,推动语义分割的计算成本降低。
  • 提出并行异质架构,以在不进行密集自注意力的情况下维持高分辨率语义。
  • 引入基于原型的局部描述与自适应频率滤波,替代标准自注意力。
  • 展示线性复杂度方法在标准基准上可超越轻量解码器。

提出的方法

  • 引入自适应频率 Transformer(AFFormer),采用并行架构以原型特征和像素描述符取代解码器,恢复高分辨率语义。
  • 使用基于 Transformer 的原型学习(PL)更新聚类原型中心 G′,并用基于 CNN 的像素描述符(PD)恢复特征 F′,以实现高分辨率细节保留。
  • 用自适应频率滤波器(AFF)替代标准自注意力,包含频率相似性核(FSK)、动态低通滤波(DLF)与动态高通滤波(DHF),实现对输入分辨率的线性复杂度。
  • 在频率提取/增强模块之间共享权重以降低成本,并在 FFN 中引入深度可分离卷积以高效融合特征。
  • 对单尺度特征采用单卷积分类层(CLS),使语义分割在实现上更接近图像分类的简单性。

实验结果

研究问题

  • RQ1是否可用无头、轻量级 Transformer 架构实现高准确度且计算成本极低的语义分割?
  • RQ2原型基表示与自适应频率处理如何替代分割模型中的传统解码器和自注意力?
  • RQ3PD(像素描述符)与 PL(原型学习)的并行异质设计以及自适应频率滤波对不同数据集(ADE20K、Cityscapes、COCO-stuff)之性能与效率有何影响?

主要发现

模型参数量#Param.FLOPsmIoU
AFFormer-tiny1.6M2.8G38.7
AFFormer-small2.3M3.6G40.2
AFFormer-base3.0M4.6G41.8
Segformer3.8M8.4G39.3
  • AFFormer 在 ADE20K(512×512)上以 4.6 GFLOPs、3M 参数实现 41.8 mIoU,较 SegFormer 提升 4.4 mIoU,且 FLOPs 下降 45%。
  • 在 Cityscapes 上,AFFormer 以 34.4 GFLOPs 实现 78.7 mIoU,较 SegFormer 提升 2.5 mIoU,FLOPs 下降 72.5%。
  • AFFormer-tiny、AFFormer-small 与 AFFormer-base 相较 SegFormer 与其他轻量对手在参数量与 FLOPs 显著减少的同时提供有利的速度–准确率权衡。
  • 并行架构去掉解码器并结合 PD 与 PL,相较简单的金字塔或仅 ViT 的设计在准确率与计算成本上具有更优表现。
  • 带有 FSK、DLF 与 DHF 的自频率原型学习模块对在不同数据集上维持高分割质量贡献显著,消融实验表明组件组合时性能最佳。
  • AFFormer 在高分辨率 Cityscapes 上实现显著的 FPS 提升(22 FPS 对比 SegFormer 的 12 FPS),同时实现更高的 mIoU。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。