QUICK REVIEW

[论文解读] Measuring Depression Symptom Severity from Spoken Language and 3D Facial Expressions

Albert Haque, Michelle Guo|arXiv (Cornell University)|Nov 21, 2018

Mental Health via Writing参考文献 36被引用 98

一句话总结

该论文开发了一个多模态深度学习模型，结合音频、3D 面部表情和文本来预测 PHQ 分数并检测重度抑郁障碍，在 DAIC-WOZ 数据集上实现 PHQ 回归的平均误差 3.67，以及用于 MDD 检测的敏感性 83.3%、特异性 82.6%。

ABSTRACT

With more than 300 million people depressed worldwide, depression is a global problem. Due to access barriers such as social stigma, cost, and treatment availability, 60% of mentally-ill adults do not receive any mental health services. Effective and efficient diagnosis relies on detecting clinical symptoms of depression. Automatic detection of depressive symptoms would potentially improve diagnostic accuracy and availability, leading to faster intervention. In this work, we present a machine learning method for measuring the severity of depressive symptoms. Our multi-modal method uses 3D facial expressions and spoken language, commonly available from modern cell phones. It demonstrates an average error of 3.67 points (15.3% relative) on the clinically-validated Patient Health Questionnaire (PHQ) scale. For detecting major depressive disorder, our model demonstrates 83.3% sensitivity and 82.6% specificity. Overall, this paper shows how speech recognition, computer vision, and natural language processing can be combined to assist mental health patients and practitioners. This technology could be deployed to cell phones worldwide and facilitate low-cost universal access to mental health care.

研究动机与目标

旨在利用来自常见智能手机来源模态的自派生线索实现可扩展、可获取的抑郁严重程度评估。
整合音频、视觉和语言信号以预测 PHQ 分数和 MDD 分类。
在经临床验证的数据集（DAIC-WOZ）上将所提多模态模型与现有方法进行对比评估。
证明在 C-CNN 框架中学习的句子级嵌入可以超越某些手工设计或预训练的嵌入。
讨论现实世界部署中的局限性、偏见考量与潜在影响。

提出的方法

输入模态包括音频（对数梅尔频谱）、3D 面部关键点（68 点）和文本转录。
学习一个多模态的句子级嵌入并输入到因果卷积网络（C-CNN）进行回归（PHQ 分数）和分类（MDD）。
该模型使用一个 10 层的因果卷积网络，核大小为 5，每层 128 通道，采用 dropout 与 Adam 优化。
基线比较包括 SVM、CNN+LSTM，以及其他模态组合（A、V、L、AVL）。
消融研究比较手工设计与学习得到的句子级嵌入，以及各种输入特征（Log-Mel、MFCC、3D 面孔、Word2Vec、Doc2Vec、Universal sentence embeddings）。
数据集 DAIC-WOZ，来自 189 次访谈的 50 小时数据（142 名患者）；用 PHQ-8 分数进行评估；训练/验证划分为 107/35 名患者。

实验结果

研究问题

RQ1一个多模态模型（使用音频、3D 面部表情和文本）是否能够准确估计抑郁严重程度为 PHQ 分数？
RQ2在 C-CNN 的句子级嵌入下，与词级/音素级嵌入及先前方法在抑郁分析中的表现相比如何？
RQ3在 DAIC-WOZ 数据集上，该模型在重度抑郁障碍检测方面的表现（敏感性、特异性）如何？
RQ4模态组合（A、V、L、AVL）对预测性能有何影响？

主要发现

在 PHQ 回归方面，带 AVL 模态的 C-CNN 实现了平均误差 3.67（相对 15.3%）。
在 MDD 检测方面，带 AVL 模态的 C-CNN 实现了 83.3% 的敏感性和 82.6% 的特异性。
与基线相比，所提出的带学习句子级嵌入的多模态 C-CNN 在与原始模态而非工程特征相比时显示出竞争性表现。
消融研究表明，在模型内学习的句子级嵌入（通过 LSTM 或 C-CNN）优于某些手工设计或预训练的句子嵌入。
该方法不依赖访谈上下文，能够处理无上下文元数据的句子级输入。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。