QUICK REVIEW

[论文解读] Fully Transformer Networks for Semantic Image Segmentation

Sitong Wu, Tianyi Wu|arXiv (Cornell University)|Jun 8, 2021

Advanced Neural Network Applications参考文献 60被引用 32

一句话总结

本论文提出了 Fully Transformer Networks (FTN)，使用 Pyramid Group Transformer 编码器 (PGT) 和 Feature Pyramid Transformer 解码器 (FPT)，在不使用 CNN 的情况下，在 PASCAL Context、ADE20K、COCO-Stuff 和 CelebAMask-HQ 上实现了最先进的语义分割性能。

ABSTRACT

Transformers have shown impressive performance in various natural language processing and computer vision tasks, due to the capability of modeling long-range dependencies. Recent progress has demonstrated that combining such Transformers with CNN-based semantic image segmentation models is very promising. However, it is not well studied yet on how well a pure Transformer based approach can achieve for image segmentation. In this work, we explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN). Specifically, we first propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, meanwhile reducing the computation complexity of the standard Visual Transformer (ViT). Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation. Surprisingly, this simple baseline can achieve better results on multiple challenging semantic segmentation and face parsing benchmarks, including PASCAL Context, ADE20K, COCOStuff, and CelebAMask-HQ. The source code will be released on https://github.com/BR-IDL/PaddleViT.

研究动机与目标

激发探索纯 Transformer 架构在像素级语义分割中的应用，避免 CNN 组件的使用。
引入一个分层的 Transformer 编码器 (PGT)，以可控感受野学习多尺度表示。
提出一个基于 Transformer 的解码器 (FPT)，在不同层之间融合语义和空间信息。
在标准分割基准测试中展示最先进的性能。

提出的方法

将 Pyramid Group Transformer (PGT) 定义为四阶段的编码器，包含 patch 转换和 Pyramid Group Multi-Head Self-Attention (PG-MSA)，以学习分层特征。
通过在非重叠组内组织注意力来控制感受野，这些组在各阶段逐步增大。
引入 Feature Pyramid Transformer (FPT)，作为自顶向下的多层融合解码器，带有横向连接和 Transformer 块以组建高分辨率预测。
在基于 Transformer 的编码-解码框架中训练 FTN，并在 PASCAL Context、ADE20K、COCO-Stuff、CelebAMask-HQ 上评估。
对 PGT 进行 ImageNet-1K 预训练，并在分割基准上微调；应用标准数据增强和训练计划。

实验结果

研究问题

RQ1纯 Transformer 基编码器-解码器框架是否能够在标准基准上达到或超过基于 CNN 的分割模型？
RQ2采用金字塔/分组自注意力编码器并结合基于 Transformer 的解码器，是否能有效捕捉像素级预测的多尺度上下文？
RQ3编码器/解码器的选择以及多尺度融合策略对分割精度有何影响？
RQ4相较于最先进的 Transformer 和基于 CNN 的分割方法，FTN 在准确性和效率方面表现如何？

主要发现

方法	骨干网络	mIoU	PASCAL Context	ADE20K	COCO-Stuff
FTN-T (ours)	PGT-T	51.15	47.12	41.57	-
FTN-S (ours)	PGT-S	53.09	48.68	43.63	-
FTN-B (ours)	PGT-B	54.93	50.88	44.82	-
FTN-L (ours)	PGT-L	56.05	51.36	45.89	-
UperNet(Swin-B)	Swin-B	52.57	49.72	42.20	-
SETR-MLA ViT-L/16	ViT-L/16	55.83	50.28	-	-

采用 PGT 和 FPT 的 FTN 在主要基准上达到最先进或具有竞争力的 mIoU：FTN-L 在 PASCAL Context 为 56.05%，ADE20K 为 51.36%，COCO-Stuff 为 45.89%。
FTN-T、FTN-S、FTN-B、FTN-L 在相似计算量下优于可比骨干网络（PVT、Swin、ViT），在某些设置下 FTN-L 超越 ViT-L/16。
Pyramid Group Transformer (PGT) 学习分层特征，相比全局 ViT 降低了计算量和内存消耗，从而实现有效的密集预测。
Feature Pyramid Transformer (FPT) 能有效融合多层语义和空间信息，带来相对于其他解码器的一致性提升。
FTN 的变体在 CelebAMask-HQ 面部解析任务上表现出色，FTN-L 达到 87.4 的平均 F1 分数并超过若干基线。
在 ImageNet-1K 的预训练足以获得具竞争力的结果，使用更大骨干网络和多尺度推理时可以看到进一步提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。