QUICK REVIEW

[论文解读] Depression Scale Recognition from Audio, Visual and Text Analysis

Shubham Dham, Anirudh Sharma|arXiv (Cornell University)|Sep 18, 2017

Emotion and Mood Recognition参考文献 13被引用 37

一句话总结

本文提出了一种基于音频、视觉和文本特征的多模态方法，用于从 DAIC-WOZ 数据集中的临床访谈中识别抑郁量表。该方法在面部关键点上应用高斯混合模型聚类和费雪向量编码，并结合低层次的音频和文本特征，通过决策级平均和最大输出策略进行融合，在验证集上实现了视频特征 RMSE 降低 24.5% 和音频特征 RMSE 降低 17% 的性能提升。

ABSTRACT

Depression is a major mental health disorder that is rapidly affecting lives worldwide. Depression not only impacts emotional but also physical and psychological state of the person. Its symptoms include lack of interest in daily activities, feeling low, anxiety, frustration, loss of weight and even feeling of self-hatred. This report describes work done by us for Audio Visual Emotion Challenge (AVEC) 2017 during our second year BTech summer internship. With the increase in demand to detect depression automatically with the help of machine learning algorithms, we present our multimodal feature extraction and decision level fusion approach for the same. Features are extracted by processing on the provided Distress Analysis Interview Corpus-Wizard of Oz (DAIC-WOZ) database. Gaussian Mixture Model (GMM) clustering and Fisher vector approach were applied on the visual data; statistical descriptors on gaze, pose; low level audio features and head pose and text features were also extracted. Classification is done on fused as well as independent features using Support Vector Machine (SVM) and neural networks. The results obtained were able to cross the provided baseline on validation data set by 17% on audio features and 24.5% on video features.

研究动机与目标

开发一种基于临床访谈多模态数据的自动化抑郁量表识别系统。
通过特征工程和融合策略，在 AVEC 2017 挑战赛中超越现有基线模型。
探究费雪向量编码与高斯混合模型聚类在面部运动和姿态特征上的有效性，用于抑郁检测。
评估支持向量机与神经网络在单模态与融合模态上的性能表现。
优化决策级融合技术（平均与最大输出），以提升抑郁严重程度评分的回归性能。

提出的方法

从 DAIC-WOZ 数据集中提取 2D 和 3D 面部关键点、注视方向、头部姿态及动作单元，用于视觉特征工程。
对面部区域之间相对距离应用高斯混合模型（GMM）聚类与费雪向量编码，以捕捉面部表情的时序动态。
计算注视、头部姿态和眨眼频率的统计描述符，以建模非语言行为线索。
提取低层次音频特征（如语音特征、倒谱特征、频谱特征），用于检测与抑郁相关的模式。
通过词级特征（如负面词频率、唤醒-效价评分）处理文本转录内容。
在单模态与融合特征集上训练支持向量机（SVM）与前馈神经网络，使用 Adam 优化器优化 RMSE 和 MAE 损失。

实验结果

研究问题

RQ1与原始统计特征相比，面部运动特征的费雪向量编码是否能提升抑郁严重程度回归的性能？
RQ2音频、视觉和文本特征在单独及组合使用时，对抑郁量表回归性能的提升程度如何？
RQ3通过多个模态预测的平均或最大输出方式进行决策级融合，是否能实现比单模态模型更好的泛化能力与更低误差？
RQ4不同融合权重配置对 AVEC 2017 验证集上最终回归性能有何影响？
RQ5在费雪向量特征上训练的神经网络是否能优于基线模型，更准确地预测抑郁严重程度评分？

主要发现

在验证集上，该方法在音频特征上的 RMSE 降低相比基线模型提升了 17%。
在视频特征上，该方法相比基线模型实现了 24.5% 的 RMSE 降低，证明了费雪向量与头部运动特征的有效性。
将音频与文本特征以相等权重（各 0.5）融合，在开发集上实现了最低的 RMSE（5.593）与 MAE（4.3714）。
将费雪向量与头部姿态特征以相等权重融合，实现了 RMSE 5.744 与 MAE 4.3714，表明在视觉模态上表现优异。
通过将全部四个模态（音频、文本、费雪向量、头部姿态）以相等权重（各 0.25）融合，实现了最佳整体性能，验证集上 RMSE 为 5.4143，MAE 为 4.1714。
在全部四个模态上采用最大输出融合策略，实现了 RMSE 5.3586 与 MAE 4.3714，略优于基于平均的融合策略，表明提升预测置信度可增强模型鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。