QUICK REVIEW

[论文解读] Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

Weixin Liu, Bowen Qu|arXiv (Cornell University)|Mar 18, 2026

Cleft Lip and Palate Research被引用 0

一句话总结

论文提出一个两阶段方法，先通过有监督对比学习学习一个以鼻腔为焦点的表示，然后使用冻结的编码器和轻量级分类器在真实世界、不可控的音频中稳健地筛查腭咽功能障碍。与强基线相比，显示了更优的跨域表现。

ABSTRACT

Velopharyngeal dysfunction (VPD) is characterized by inadequate velopharyngeal closure during speech and often causes hypernasality and reduced intelligibility. Although speech-based machine learning models can perform well under standardized clinical recording conditions, their performance often drops in real-world settings because of domain shift caused by differences in devices, channels, noise, and room acoustics. To improve robustness, we propose a two-stage framework for VPD screening. First, a nasality-focused speech representation is learned by supervised contrastive pre-training on an auxiliary corpus with phoneme alignments, using oral-context versus nasal-context supervision. Second, the encoder is frozen and used with lightweight classifiers on 0.5-second speech chunks, whose probabilities are aggregated to produce recording-level decisions with a fixed threshold. On an in-domain clinical cohort of 82 subjects, the proposed method achieved perfect recording-level screening performance (macro-F1 = 1.000, accuracy = 1.000). On a separate out-of-domain set of 131 heterogeneous public Internet recordings, large pretrained speech representations degraded substantially, while MFCC was the strongest baseline (macro-F1 = 0.612, accuracy = 0.641). The proposed method achieved the best out-of-domain performance (macro-F1 = 0.679, accuracy = 0.695), improving on the strongest baseline under the same evaluation protocol. These results suggest that learning a nasality-focused representation before clinical classification can reduce sensitivity to recording artifacts and improve robustness for deployable speech-based VPD screening.

研究动机与目标

推动在标准化记录稀缺的高收入以外地区实现可扩展的VPD筛查。
在将语音VPD筛查部署到消费设备时解决领域偏移。
提出一个两阶段框架，先学习鼻音聚焦的表示再进行临床分类。
在固定阈值下评估同域内和跨域鲁棒性，不进行目标领域自适应。

提出的方法

使用带有辅助数据集的有监督对比学习（SupCon）进行鼻音表示预训练，结合音位对齐形成口语上下文与鼻腔上下文的监督信号。
通过配对同一说话人、同一元音样本进行对比学习以抑制说话人和音系混淆的采样策略。
基于Wav2Vec2的编码器结构，结合层融合与部分解冻，将表示投射到256维嵌入。
冻结编码器的VPD筛查，在0.5 s片段的256维嵌入输入到轻量分类器（LR/SVM/MLP/XGBoost），并在记录级平均汇总后在固定阈值下进行判定。
在不适配目标域的情况下，在同域临床数据和跨域公开互联网上的音频上进行训练与评估。
在同一评估协议下，与使用MFCC和大型预训练语音表示的基线进行比较。

实验结果

研究问题

RQ1通过 SupCon 学习的鼻音表示是否能提高VPD筛查对领域偏移的鲁棒性？
RQ2在同域与跨域数据上，SupCon鼻音表示相比 MFCC 和大型预训练模型的表现如何？
RQ3在固定决策阈值下，冻结编码器与轻量分类器是否能达到跨域最优性能？

主要发现

在同域中，SupCon鼻音方法实现了完美的记录级筛查（准确率与宏F1均为1.000）。
在跨域中，SupCon鼻音方法的宏F1为0.679，准确率为0.695，超越最佳基线0.067的宏F1和0.054的准确率。
MFCC+SVM在跨域仍是强基线（宏F1 0.612，准确率0.641），而大型预训练表示在领域迁移时表现下降。
在所有基线中，SupCon鼻音嵌入与任意 LR、SVM、MLP、XGBoost 组合都表现稳健，其中 MLP 在 SupCon 变体中实现了最高的跨域准确率。
可视化（UMAP）显示口语核心与鼻音较强段落在元音之间呈现部分分离，表明鼻音捕捉了有意义的产出相关结构。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。