QUICK REVIEW

[论文解读] Learning Better Features for Face Detection with Feature Fusion and Segmentation Supervision

Wanxin Tian, Zixuan Wang|arXiv (Cornell University)|Nov 20, 2018

Face recognition and analysis参考文献 46被引用 33

一句话总结

本文提出 DF²S²，一种单阶段人脸检测器，通过新颖的特征融合特征金字塔和自监督分割分支改进特征学习。通过在高阶语义与低阶细节之间应用空间与通道注意力进行融合，并利用弱监督分割实现判别性特征学习，该模型在 WIDER FACE 数据集上实现了 95.6%（Easy）、94.7%（Medium）和 89.8%（Hard）的最先进 mAP，同时保持 26.45 FPS 的实时推理速度。

ABSTRACT

The performance of face detectors has been largely improved with the development of convolutional neural network. However, it remains challenging for face detectors to detect tiny, occluded or blurry faces. Besides, most face detectors can't locate face's position precisely and can't achieve high Intersection-over-Union (IoU) scores. We assume that problems inside are inadequate use of supervision information and imbalance between semantics and details at all level feature maps in CNN even with Feature Pyramid Networks (FPN). In this paper, we present a novel single-shot face detection network, named DF$^2$S$^2$ (Detection with Feature Fusion and Segmentation Supervision), which introduces a more effective feature fusion pyramid and a more efficient segmentation branch on ResNet-50 to handle mentioned problems. Specifically, inspired by FPN and SENet, we apply semantic information from higher-level feature maps as contextual cues to augment low-level feature maps via a spatial and channel-wise attention style, preventing details from being covered by too much semantics and making semantics and details complement each other. We further propose a semantic segmentation branch to best utilize detection supervision information meanwhile applying attention mechanism in a self-supervised manner. The segmentation branch is supervised by weak segmentation ground-truth (no extra annotation is required) in a hierarchical manner, deprecated in the inference time so it wouldn't compromise the inference speed. We evaluate our model on WIDER FACE dataset and achieved state-of-art results.

研究动机与目标

解决在真实场景中检测极小、遮挡或模糊人脸的挑战。
通过在特征金字塔各层级之间平衡语义信息与细节信息，改进特征表示。
通过分割分支更好地利用监督信号，克服基于锚框检测的局限性。
在提升检测精度的同时，通过高效网络结构设计保持实时推理速度。

提出的方法

提出一种基于空间与通道注意力的特征融合机制，利用高阶语义特征作为上下文线索，增强低阶特征图。
引入一个自监督语义分割分支，利用弱监督的边界框标注进行训练，以指导特征学习，而无需额外标注。
在分割分支中应用分层监督，以提升特征判别能力，同时保持推理速度。
使用转置卷积进行特征上采样，以在融合过程中保持空间分辨率，最大限度减少信息损失。
采用多任务训练策略，结合检测与分割损失，并通过自适应加权平衡优化过程。
仅在训练阶段部署分割分支，推理时移除以避免速度下降。

实验结果

研究问题

RQ1如何在不抑制细粒度特征的前提下，使人脸检测器中的特征融合更好地平衡语义丰富性与空间细节？
RQ2自监督分割分支是否能在不依赖额外标注的情况下提升人脸检测中的特征学习？
RQ3在 WIDER FACE 等具有挑战性的基准上，集成分割监督能在多大程度上提升性能？
RQ4与最先进单阶段人脸检测器相比，该方法在精度与速度上的表现如何？

主要发现

使用 ResNet-50 作为骨干网络时，DF²S² 在 WIDER FACE 验证集的 Easy、Medium 和 Hard 子集上分别实现了 95.6%、94.7% 和 89.8% 的最先进 mAP。
当使用 ResNet-101 作为骨干网络时，模型在 Easy、Medium 和 Hard 子集上的 mAP 分别达到 96.9%、95.9% 和 91.2%，展现出强大的可扩展性。
与 PyramidBox 相比，该模型在 Hard 子集上的检测性能提升了 +0.9%，表明其在遮挡和尺度变化下具有更强的鲁棒性。
最优分割损失权重为 λ₂ = 0.05，其他取值下性能下降极小，表明训练过程稳定。
在 Tesla P40 GPU 上，对于 640×512 输入，模型保持 26.45 FPS 的实时推理速度，证实了其高效性，即使增加了新组件。
消融实验表明，基于注意力的融合与分割分支均独立贡献性能提升，其中分割分支在困难样本上尤为有效。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。