QUICK REVIEW

[论文解读] Robust Facial Expression Recognition with Convolutional Visual Transformers

Fuyan Ma, Bin Sun|arXiv (Cornell University)|Mar 31, 2021

Emotion and Mood Recognition参考文献 56被引用 31

一句话总结

该论文提出了一种卷积视觉变换器，通过注意力选择性融合多尺度CNN特征并利用全局自注意力建模视觉标记，实现了在真实世界环境下鲁棒的面部表情识别。在RAF-DB（88.14%）、FERPlus（88.81%）和AffectNet（61.85%）上均取得了最先进性能，证明了其在遮挡、姿态变化等现实挑战下的强大泛化能力和鲁棒性。

ABSTRACT

Facial Expression Recognition (FER) in the wild is extremely challenging due to occlusions, variant head poses, face deformation and motion blur under unconstrained conditions. Although substantial progresses have been made in automatic FER in the past few decades, previous studies are mainly designed for lab-controlled FER. Real-world occlusions, variant head poses and other issues definitely increase the difficulty of FER on account of these information-deficient regions and complex backgrounds. Different from previous pure CNNs based methods, we argue that it is feasible and practical to translate facial images into sequences of visual words and perform expression recognition from a global perspective. Therefore, we propose Convolutional Visual Transformers to tackle FER in the wild by two main steps. First, we propose an attentional selective fusion (ASF) for leveraging the feature maps generated by two-branch CNNs. The ASF captures discriminative information by fusing multiple features with global-local attention. The fused feature maps are then flattened and projected into sequences of visual words. Second, inspired by the success of Transformers in natural language processing, we propose to model relationships between these visual words with global self-attention. The proposed method are evaluated on three public in-the-wild facial expression datasets (RAF-DB, FERPlus and AffectNet). Under the same settings, extensive experiments demonstrate that our method shows superior performance over other methods, setting new state of the art on RAF-DB with 88.14%, FERPlus with 88.81% and AffectNet with 61.85%. We also conduct cross-dataset evaluation on CK+ show the generalization capability of the proposed method.

研究动机与目标

解决在非受控真实环境中由于遮挡、姿态变化和运动模糊导致性能下降的面部表情识别挑战。
通过引入全局序列建模方法，克服以往基于CNN的方法在受控实验室数据上训练所存在的局限性。
通过将面部特征转换为视觉词序列并利用自注意力建模长距离依赖关系，实现鲁棒的表情识别。
通过在CK+数据集上的跨数据集评估，展示模型的泛化能力，证明其在训练域之外的可迁移性。

提出的方法

采用双分支CNN架构，从面部图像中提取多尺度特征图，捕捉局部和全局表征。
引入注意力选择性融合（ASF）模块，通过全局-局部注意力自适应融合特征，突出判别性区域。
将融合后的特征图展平并投影为视觉标记序列，将每个标记视为可学习的视觉词。
应用带有多头自注意力机制的Transformer编码器，对视觉标记之间的长距离依赖关系进行建模，以提升表情分类性能。
在标准真实世界表情识别数据集上，使用交叉熵损失端到端训练模型。
使用位置编码以保留视觉标记序列中的空间关系。

实验结果

研究问题

RQ1通过自注意力建模视觉标记序列，能否提升在非受控环境中面部表情识别的鲁棒性？
RQ2与标准CNN特征融合相比，基于注意力的特征融合在判别性表征学习方面有何提升？
RQ3所提方法在不同数据分布的数据集之间，其泛化能力达到何种程度？
RQ4Transformer的全局建模能力是否在真实世界FER基准上优于基于局部感受野的CNN？

主要发现

在标准评估设置下，该方法在RAF-DB数据集上取得了88.14%的最先进准确率。
在FERPlus数据集上达到88.81%的准确率，优于以往在真实世界设置下的方法。
在大规模AffectNet数据集上，模型取得了61.85%的准确率，证明其在高可变性和真实世界噪声下的有效性。
在CK+数据集上的跨数据集评估显示了强大的泛化能力，表明模型学习到了鲁棒且解耦的面部表情表征。
注意力选择性融合机制通过聚焦于判别性面部区域，有效增强了特征表征。
将视觉标记与自注意力机制结合，显著提升了在所有三个真实世界数据集上相对于纯CNN基线模型的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。