QUICK REVIEW

[论文解读] Combining Acoustics, Content and Interaction Features to Find Hot Spots in Meetings

Dave Makhervaks, William Hinthorn|arXiv (Cornell University)|Oct 24, 2019

Topic Modeling参考文献 18被引用 3

一句话总结

本文提出一种机器学习方法，通过融合语音语调、词汇（基于 BERT）和交互特征，检测会议中的参与热点区域。基于 ICSI 会议语料库的实验表明，词嵌入最具信息量，语调和交互特征带来增量提升，当所有特征结合时，未加权平均召回率（UAR）达到 72.6%。

ABSTRACT

Involvement hot spots have been proposed as a useful concept for meeting analysis and studied off and on for over 15 years. These are regions of meetings that are marked by high participant involvement, as judged by human annotators. However, prior work was either not conducted in a formal machine learning setting, or focused on only a subset of possible meeting features or downstream applications (such as summarization). In this paper we investigate to what extent various acoustic, linguistic and pragmatic aspects of the meetings, both in isolation and jointly, can help detect hot spots. In this context, the openSMILE toolkit is to used to extract features based on acoustic-prosodic cues, BERT word embeddings are used for encoding the lexical content, and a variety of statistics based on speech activity are used to describe the verbal interaction among participants. In experiments on the annotated ICSI meeting corpus, we find that the lexical model is the most informative, with incremental contributions from interaction and acoustic-prosodic model components.

研究动机与目标

使用机器学习自动检测会议中高参与度区域（热点）
评估语音语调特征、词汇内容特征和说话人交互特征对热点检测的相对贡献
研究特征融合策略，以在单一特征集之外提升检测性能
在 ICSI 会议语料库上验证该方法，使用人工标注的热点区域作为基准
评估笑声作为强但不可迁移的提示信号在一般会议场景中的影响

提出的方法

使用 openSMILE 工具包提取语音语调特征，以捕捉如音高和能量等语调线索
基于自动语音识别（ASR）转录文本，使用 BERT 生成上下文相关的词嵌入，以捕捉词汇内容
从语音活动模式中计算交互特征，包括说话人重叠比例、唯一说话人数和话语切换次数
训练逻辑回归模型，基于融合的特征表示，将滑动时间窗分类为“热点”或“非热点”
采用留一法分析评估各特征集的重要性及其互补性
使用保留测试集上的未加权平均召回率（UAR）评估性能

实验结果

研究问题

RQ1语音语调、词汇和交互特征在会议热点检测中各自发挥怎样的作用？
RQ2当这三种特征类型在机器学习模型中融合时，其互补性在多大程度上体现？
RQ3将笑声作为特征包含在内如何影响检测性能？其在非非正式会议场景中是否具有可迁移性？
RQ4使用上下文词嵌入（如 BERT）是否优于传统方法（如 TF-IDF）用于参与度分类？
RQ5简单的逻辑回归模型能否有效结合多种特征类型以实现热点检测？

主要发现

使用 BERT 词嵌入的词汇模型单独达到最高的个体 UAR（70.5%），显著优于 TF-IDF（59.8%）
仅使用语音语调特征时，UAR 为 62.0%，表明其对热点检测具有中等但有意义的贡献
交互特征（如话语切换次数和说话人重叠）贡献增量提升，单独使用时 UAR 达 66.6%
结合三种特征类型的联合模型达到 UAR 72.6%，表明其贡献非冗余且具有互补性
包含笑声特征后，UAR 提升至 77.5%，但其在一般会议类型（如商务会议）中被认为可迁移性较低
留一法分析显示，移除词嵌入导致性能下降最大，证实其在融合模型中的主导作用

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。