QUICK REVIEW

[论文解读] Cross-Task Multi-Branch Vision Transformer for Facial Expression and Mask Wearing Classification

Armando Zhu, Keqin Li|arXiv (Cornell University)|Apr 22, 2024

Industrial Vision Systems and Defect Detection被引用 13

一句话总结

一个统一的双分支 Vision Transformer，用于同时的人脸表情识别（FER）和戴口罩分类，通过一个跨任务融合阶段通过跨注意力实现信息交换。

ABSTRACT

With wearing masks becoming a new cultural norm, facial expression recognition (FER) while taking masks into account has become a significant challenge. In this paper, we propose a unified multi-branch vision transformer for facial expression recognition and mask wearing classification tasks. Our approach extracts shared features for both tasks using a dual-branch architecture that obtains multi-scale feature representations. Furthermore, we propose a cross-task fusion phase that processes tokens for each task with separate branches, while exchanging information using a cross attention module. Our proposed framework reduces the overall complexity compared with using separate networks for both tasks by the simple yet effective cross-task fusion phase. Extensive experiments demonstrate that our proposed model performs better than or on par with different state-of-the-art methods on both facial expression recognition and facial mask wearing classification task.

研究动机与目标

将戴口罩条件下的 FER 作为一个统一问题来解决。
通过具备多尺度表示的双分支架构，利用共享特征和任务特定特征。
通过引入跨任务融合阶段，在保持性能的同时减少与独立网络相比的模型复杂度。

提出的方法

使用双分支 Vision Transformer 提取用于 FER 和戴口罩的共享多尺度特征。
在各自分支中处理每个任务的 token，同时实现信息交换。
引入带有跨注意力模块的跨任务融合阶段，用于跨任务信息共享。
旨在在保持性能的同时降低相对于独立网络的整体复杂度。

实验结果

研究问题

RQ1在戴口罩时，统一的多分支架构是否相对于任务特定模型能提升 FER？
RQ2通过跨注意力的跨任务融合是否能提升对 FER 与戴口罩分类的性能？
RQ3所提出的跨任务架构是否比为两个任务使用两个独立网络更高效？

主要发现

所提出的模型在 FER 和戴口罩分类方面的性能与现有方法稳健相当。
带有跨注意力的跨任务融合促进信息交换，在遮挡条件下提高识别性能。
与为每个任务使用独立网络相比，该框架降低了整体复杂度。
实验表明该模型在两项任务上与若干基线方法相当甚至优于它们。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。