Skip to main content
QUICK REVIEW

[论文解读] Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

Weixin Liu, Bowen Qu|arXiv (Cornell University)|Mar 18, 2026
Cleft Lip and Palate Research被引用 0
一句话总结

论文提出一个两阶段方法,先通过有监督对比学习学习一个以鼻腔为焦点的表示,然后使用冻结的编码器和轻量级分类器在真实世界、不可控的音频中稳健地筛查腭咽功能障碍。与强基线相比,显示了更优的跨域表现。

ABSTRACT

Velopharyngeal dysfunction (VPD) is characterized by inadequate velopharyngeal closure during speech and often causes hypernasality and reduced intelligibility. Although speech-based machine learning models can perform well under standardized clinical recording conditions, their performance often drops in real-world settings because of domain shift caused by differences in devices, channels, noise, and room acoustics. To improve robustness, we propose a two-stage framework for VPD screening. First, a nasality-focused speech representation is learned by supervised contrastive pre-training on an auxiliary corpus with phoneme alignments, using oral-context versus nasal-context supervision. Second, the encoder is frozen and used with lightweight classifiers on 0.5-second speech chunks, whose probabilities are aggregated to produce recording-level decisions with a fixed threshold. On an in-domain clinical cohort of 82 subjects, the proposed method achieved perfect recording-level screening performance (macro-F1 = 1.000, accuracy = 1.000). On a separate out-of-domain set of 131 heterogeneous public Internet recordings, large pretrained speech representations degraded substantially, while MFCC was the strongest baseline (macro-F1 = 0.612, accuracy = 0.641). The proposed method achieved the best out-of-domain performance (macro-F1 = 0.679, accuracy = 0.695), improving on the strongest baseline under the same evaluation protocol. These results suggest that learning a nasality-focused representation before clinical classification can reduce sensitivity to recording artifacts and improve robustness for deployable speech-based VPD screening.

研究动机与目标

  • 推动在标准化记录稀缺的高收入以外地区实现可扩展的VPD筛查。
  • 在将语音VPD筛查部署到消费设备时解决领域偏移。
  • 提出一个两阶段框架,先学习鼻音聚焦的表示再进行临床分类。
  • 在固定阈值下评估同域内和跨域鲁棒性,不进行目标领域自适应。

提出的方法

  • 使用带有辅助数据集的有监督对比学习(SupCon)进行鼻音表示预训练,结合音位对齐形成口语上下文与鼻腔上下文的监督信号。
  • 通过配对同一说话人、同一元音样本进行对比学习以抑制说话人和音系混淆的采样策略。
  • 基于Wav2Vec2的编码器结构,结合层融合与部分解冻,将表示投射到256维嵌入。
  • 冻结编码器的VPD筛查,在0.5 s片段的256维嵌入输入到轻量分类器(LR/SVM/MLP/XGBoost),并在记录级平均汇总后在固定阈值下进行判定。
  • 在不适配目标域的情况下,在同域临床数据和跨域公开互联网上的音频上进行训练与评估。
  • 在同一评估协议下,与使用MFCC和大型预训练语音表示的基线进行比较。

实验结果

研究问题

  • RQ1通过 SupCon 学习的鼻音表示是否能提高VPD筛查对领域偏移的鲁棒性?
  • RQ2在同域与跨域数据上,SupCon鼻音表示相比 MFCC 和大型预训练模型的表现如何?
  • RQ3在固定决策阈值下,冻结编码器与轻量分类器是否能达到跨域最优性能?

主要发现

  • 在同域中,SupCon鼻音方法实现了完美的记录级筛查(准确率与宏F1均为1.000)。
  • 在跨域中,SupCon鼻音方法的宏F1为0.679,准确率为0.695,超越最佳基线0.067的宏F1和0.054的准确率。
  • MFCC+SVM在跨域仍是强基线(宏F1 0.612,准确率0.641),而大型预训练表示在领域迁移时表现下降。
  • 在所有基线中,SupCon鼻音嵌入与任意 LR、SVM、MLP、XGBoost 组合都表现稳健,其中 MLP 在 SupCon 变体中实现了最高的跨域准确率。
  • 可视化(UMAP)显示口语核心与鼻音较强段落在元音之间呈现部分分离,表明鼻音捕捉了有意义的产出相关结构。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。