QUICK REVIEW

[论文解读] IDRNet: Intervention-Driven Relation Network for Semantic Segmentation

Zhenchao Jin, Xiaowei Hu|arXiv (Cornell University)|Oct 16, 2023

Multimodal Machine Learning Applications被引用 14

一句话总结

IDRNet 引入一个以干预为驱动的范式，通过删除诊断来建立语义层级关系，以提升像素表示，在多个基准上通过轻量且兼容的模块改善分割。

ABSTRACT

Co-occurrent visual patterns suggest that pixel relation modeling facilitates dense prediction tasks, which inspires the development of numerous context modeling paradigms, \emph{e.g.}, multi-scale-driven and similarity-driven context schemes. Despite the impressive results, these existing paradigms often suffer from inadequate or ineffective contextual information aggregation due to reliance on large amounts of predetermined priors. To alleviate the issues, we propose a novel extbf{I}ntervention- extbf{D}riven extbf{R}elation extbf{Net}work ( extbf{IDRNet}), which leverages a deletion diagnostics procedure to guide the modeling of contextual relations among different pixels. Specifically, we first group pixel-level representations into semantic-level representations with the guidance of pseudo labels and further improve the distinguishability of the grouped representations with a feature enhancement module. Next, a deletion diagnostics procedure is conducted to model relations of these semantic-level representations via perceiving the network outputs and the extracted relations are utilized to guide the semantic-level representations to interact with each other. Finally, the interacted representations are utilized to augment original pixel-level representations for final predictions. Extensive experiments are conducted to validate the effectiveness of IDRNet quantitatively and qualitatively. Notably, our intervention-driven context scheme brings consistent performance improvements to state-of-the-art segmentation frameworks and achieves competitive results on popular benchmark datasets, including ADE20K, COCO-Stuff, PASCAL-Context, LIP, and Cityscapes. Code is available at \url{https://github.com/SegmentationBLWX/sssegmentation}.

研究动机与目标

激励并解决现有在语义分割中依赖事先先验信息的上下文模块的局限性。
提出一种以干预驱动的范式，模型化语义层级关系以引导像素交互。
开发一个删除诊断机制，以更新语义关系矩阵以改进分割。
展示与流行的分割骨干网络和框架整合时的兼容性与性能提升。

提出的方法

使用伪标签将像素级特征分组为语义级表示。
通过一个判别性特征增强模块来提升语义级特征。
通过删除诊断构建和更新语义级关系矩阵，实现类别间交互。
让语义级表示相互作用，产生增强特征，从而提升像素表示。
将增强特征与原始像素表示融合，并在最终预测前应用自注意力。
使用联合目标进行训练，将伪标签和最终预测的交叉熵损失结合起来。

Figure 1: Diagram of our intervention-driven relation network. Deletion diagnostics is leveraged to build relations between semantic-level representations. With the built relation matrix and semantic-level representations, pixel representations can be augmented for pixel prediction.

实验结果

研究问题

RQ1删除诊断是否能通过关注语义层级交互来有效引导像素关系的构建？
RQ2以干预驱动的上下文方案是否在不同数据集与骨干网络上持续提高分割精度？
RQ3将 IDRNet 集成到诸如 FCN、PSPNet、DeeplabV3 和 UPerNet 等现有框架时，准确性与效率如何？
RQ4语义层级关系方法在跨域分割任务中是否具有鲁棒性？

主要发现

上下文模块	参数	FLOPS	时间	GPU内存	mIoU (%) ADE20K (train/val)
OCR	15.12M	242.48G	16.58ms	617.24M	42.47
ASPP	--	674.47G	41.98ms	976.06M	43.19
PPM	23.07M	309.45G	21.45ms	960.63M	42.64
UPerNet	34.75M	500.76G	36.51ms	1429.18M	43.02
ANN	22.42M	369.62G	26.58ms	1445.75M	41.75
CCNet	23.92M	397.38G	30.92ms	986.28M	42.48
DNL	24.12M	395.25G	51.38ms	2381.04M	43.50
IDRNet	10.79M	155.89G	20.52ms	365.66M	43.61
PPM+IDRNet	23.65M	349.23G	32.64ms	1034.28M	44.02

IDRNet 及其变体 IDRNet+ 在 ADE20K、Cityscapes、COCO-Stuff、LIP、PASCAL-Context 等流行基准上实现一致的性能提升。
在 ADE20K 上，IDRNet 结合基线骨干获得相较于若干上下文方案的显著 mIoU 提升（例如，在消融研究中 IDRNet 单独在 ADE20K 达到 43.61% mIoU；IDRNet+ 与如 UPerNet 的骨干显示出显著增益）。
此方法在使用相对轻量的上下文模块时，能够获得具有竞争力或更优的结果（IDRNet 的参数和 FLOPS 相较于众多对手更少；例如 10.79M 参数，155.89G FLOPS，20.52ms 时间，365.66M GPU 内存，ADE20K 上 43.61% mIoU）。
删除诊断在更新关系矩阵方面优于基于反向传播的 M_r 更新（BD/DD 驱动的更新显示改进；例如 DD 驱动的 M_r 将 ADE20K 的 mIoU 相对 BP 提升 3.26%）。
平衡删除增加对罕见类别的采样，在 ADE20K、PASCAL-Context 和 COCO-Stuff 等数据集上提升性能。
跨域提升可观察，例如在 Cityscapes 上训练的 DeeplabV3+IDRNet，当迁移到 Dark Zurich 和 Nighttime Driving 时，分别提升 3.63 和 1.94 mIoU。

Figure 2: Illustration of our intervention-driven relation network (IDRNet). We first extract pixel representations $R_{p}$ using a backbone network $\mathcal{F}_{B}$ , e.g. , ResNet [ 30 ] or SwinTransformer [ 15 ] . Then, $R_{p}$ is grouped into semantic-level representations $R_{sl}$ based on a c

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。