QUICK REVIEW

[论文解读] Multimodal Engagement Analysis from Facial Videos in the Classroom

Ömer Sümer, Patricia Goldberg|arXiv (Cornell University)|Jan 11, 2021

Online Learning and Analytics被引用 2

一句话总结

本研究提出一种基于计算机视觉的系统，通过面部视频自动分析课堂环境中的学生参与度。该系统采用深度学习模型进行头部姿态估计（Attention-Net）和面部表情识别（Affect-Net），训练了多种分类器（SVM、随机森林、MLP、LSTM），在分数级融合与个性化处理下，AUC分别达到 .620（八年级）和 .720（十二年级），性能最高提升 .084 AUC。

ABSTRACT

Student engagement is a key construct for learning and teaching. While most of the literature explored the student engagement analysis on computer-based settings, this paper extends that focus to classroom instruction. To best examine student visual engagement in the classroom, we conducted a study utilizing the audiovisual recordings of classes at a secondary school over one and a half month's time, acquired continuous engagement labeling per student (N=15) in repeated sessions, and explored computer vision methods to classify engagement levels from faces in the classroom. We trained deep embeddings for attentional and emotional features, training Attention-Net for head pose estimation and Affect-Net for facial expression recognition. We additionally trained different engagement classifiers, consisting of Support Vector Machines, Random Forest, Multilayer Perceptron, and Long Short-Term Memory, for both features. The best performing engagement classifiers achieved AUCs of .620 and .720 in Grades 8 and 12, respectively. We further investigated fusion strategies and found score-level fusion either improves the engagement classifiers or is on par with the best performing modality. We also investigated the effect of personalization and found that using only 60-seconds of person-specific data selected by margin uncertainty of the base classifier yielded an average AUC improvement of .084. 4.Our main aim with this work is to provide the technical means to facilitate the manual data analysis of classroom videos in research on teaching quality and in the context of teacher training.

研究动机与目标

开发基于面部视频数据的自动化方法，用于在真实课堂环境中评估学生参与度。
探究利用计算机视觉与深度学习从视觉线索（如注意力与情绪）估计参与度的可行性。
评估特征融合与个性化对参与度分类性能的影响。
通过实现对课堂视频数据的可扩展、高效分析，支持教学质量和教师培训研究。

提出的方法

在一所中学对15名学生（八年级与十二年级）连续采集1.5个月的音视频记录。
训练深度神经网络：使用Attention-Net进行头部姿态估计，使用Affect-Net进行面部表情识别。
利用预训练的深度嵌入从面部视频中提取注意力与情绪特征。
基于这些特征训练多种参与度分类器——SVM、随机森林、多层感知机与LSTM。
应用分数级融合策略，整合不同模型与模态的预测结果。
利用置信度边界不确定性选择60秒的个性化数据段用于模型微调，提升泛化能力。

实验结果

研究问题

RQ1基于计算机视觉的面部视频分析能否在真实课堂环境中准确估计学生参与度？
RQ2当使用面部视频中的注意力与情绪特征时，不同参与度分类器（SVM、随机森林、MLP、LSTM）的表现如何？
RQ3多模型的分数级融合是否能提升参与度分类性能，超越单一模态的表现？
RQ4利用短时、高不确定性数据段进行个性化处理，在多大程度上能提升分类器性能？
RQ5样本量限制与类别不平衡对低参与度状态检测有何影响？

主要发现

表现最佳的参与度分类器在十二年级达到 .720 AUC，在八年级达到 .620 AUC，表明具有中等至良好的判别性能。
多模型的分数级融合要么提升，要么匹配了最佳单一模型的性能，表明不同模态间存在互补信息。
仅使用由置信度边界不确定性选择的60秒数据进行个性化处理，平均AUC提升了 .084，证明了最小化个性化策略的价值。
模型在检测低参与度方面表现不佳，可能由于类别不平衡与数据量有限，凸显了当前数据采集中的关键局限。
本研究证明，基于面部视频的自动化参与度分析在课堂研究中具有可行性与可扩展性，尤其在结合个性化与融合技术时。
伦理部署要求立即删除原始视频数据，并对结果进行聚合处理，以保护学生隐私。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。