QUICK REVIEW

[论文解读] Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition

Kai Wang, Xiaojiang Peng|arXiv (Cornell University)|May 10, 2019

Emotion and Mood Recognition参考文献 69被引用 28

一句话总结

本文提出了一种区域注意力网络（RAN）用于姿态和遮挡鲁棒的面部表情识别，通过自适应注意力机制聚焦于关键面部区域，同时结合区域偏置损失以优先关注动作单元。该方法在 FERPlus、AffectNet、RAF-DB 和 SFEW 上实现了最先进性能，FerPlus 上准确率最高达 89.16%，AffectNet 上达 59.5%（使用过采样）。

ABSTRACT

Occlusion and pose variations, which can change facial appearance significantly, are two major obstacles for automatic Facial Expression Recognition (FER). Though automatic FER has made substantial progresses in the past few decades, occlusion-robust and pose-invariant issues of FER have received relatively less attention, especially in real-world scenarios. This paper addresses the real-world pose and occlusion robust FER problem with three-fold contributions. First, to stimulate the research of FER under real-world occlusions and variant poses, we build several in-the-wild facial expression datasets with manual annotations for the community. Second, we propose a novel Region Attention Network (RAN), to adaptively capture the importance of facial regions for occlusion and pose variant FER. The RAN aggregates and embeds varied number of region features produced by a backbone convolutional neural network into a compact fixed-length representation. Last, inspired by the fact that facial expressions are mainly defined by facial action units, we propose a region biased loss to encourage high attention weights for the most important regions. We validate our RAN and region biased loss on both our built test datasets and four popular datasets: FERPlus, AffectNet, RAF-DB, and SFEW. Extensive experiments show that our RAN and region biased loss largely improve the performance of FER with occlusion and variant pose. Our method also achieves state-of-the-art results on FERPlus, AffectNet, RAF-DB, and SFEW. Code and the collected test data will be publicly available.

研究动机与目标

解决现有真实世界标注数据集在遮挡和姿态变化下的面部表情识别任务中缺失的问题。
开发一种深度学习模型，通过自适应加权面部区域以提升对遮挡和姿态变化的鲁棒性。
设计一种区域偏置损失函数，以鼓励模型对关键面部动作单元所在区域给予更高的注意力权重。
在具有挑战性的现实世界条件下，于多个基准数据集上展示最先进性能。

提出的方法

对野外数据集（FERPlus、AffectNet、RAF-DB、SFEW）进行姿态和遮挡属性的标注，以构建新的基准测试集。
提出一种区域注意力网络（RAN），利用自注意力与关系注意力模块，将多个面部区域的特征聚合为固定长度的表示。
集成一种区域偏置损失（RB-Loss），以鼓励对与关键动作单元相关的面部区域赋予更高的注意力权重。
使用主干卷积神经网络（如 ResNet18、VGG16）提取区域特征，随后通过 RAN 以端到端方式学习动态注意力权重。
通过区域裁剪与重缩放进行数据增强，以提升对罕见或困难样本的特征学习能力。
使用 RAN 和 RB-Loss 对预训练模型（如 VGGFace、MS-Celeb-1M）进行微调，以提升在真实世界 FER 任务中的泛化能力。

实验结果

研究问题

RQ1遮挡和姿态变化在多大程度上降低现有面部表情识别模型在真实世界数据集上的性能？
RQ2一种聚焦于面部区域的可学习注意力机制是否能提升 FER 模型对遮挡和姿态变化的鲁棒性？
RQ3区域偏置损失函数在多大程度上增强了模型对关键面部动作单元的注意力，从而提升表情识别性能？
RQ4所提出的 RAN 框架是否在多种基准数据集上，在真实世界遮挡和姿态条件下实现了最先进性能？

主要发现

所提出的 RAN 在 FERPlus 上达到 89.16% 的准确率，超越以往最先进方法，创下新 SOTA 记录。
在 AffectNet 上，RAN 模型在使用过采样时达到 59.5% 的准确率，优于以往使用更大网络或额外数据集的方法。
在 RAF-DB 上，RAN 达到 86.90% 的准确率，分别比 DLP-CNN 和 gACNN 高出 2.77% 和 1.83%。
在 SFEW 上，单模型 RAN 在验证集上达到 54.19%，为迄今报告的最佳单模型结果。
采用 RAN-ResNet18 与 RAN-VGG16 的模型集成在 SFEW 上达到 56.4%，优于以往的集成方法。
RAN 模型将推理时间增加至每张图像 0.025 秒（相比基线的 0.006 秒），但保持了高效的 GPU 并行处理能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。