QUICK REVIEW

[论文解读] The PartialSpoof Database and Countermeasures for the Detection of Short Fake Speech Segments Embedded in an Utterance

Lin Zhang, Xin Wang|arXiv (Cornell University)|Apr 11, 2022

Speech Recognition and Synthesis被引用 4

一句话总结

本文介绍了PartialSpoof数据库以及一种新型反制措施（CM），用于检测真实语音语句中嵌入的短时伪造语音片段——一种新型欺骗攻击场景，称为“部分欺骗”（Partial Spoof, PS）。该CM利用自监督学习（SSL）模型作为增强的特征提取器，并在多个时间分辨率（20–640 ms）上联合训练，同时使用片段级和语音段级标签，在语音段级达到最先进的等错误率（EER）为0.77%（PS）和0.90%（LA）。

ABSTRACT

Automatic speaker verification is susceptible to various manipulations and spoofing, such as text-to-speech synthesis, voice conversion, replay, tampering, adversarial attacks, and so on. We consider a new spoofing scenario called "Partial Spoof" (PS) in which synthesized or transformed speech segments are embedded into a bona fide utterance. While existing countermeasures (CMs) can detect fully spoofed utterances, there is a need for their adaptation or extension to the PS scenario. We propose various improvements to construct a significantly more accurate CM that can detect and locate short-generated spoofed speech segments at finer temporal resolutions. First, we introduce newly developed self-supervised pre-trained models as enhanced feature extractors. Second, we extend our PartialSpoof database by adding segment labels for various temporal resolutions. Since the short spoofed speech segments to be embedded by attackers are of variable length, six different temporal resolutions are considered, ranging from as short as 20 ms to as large as 640 ms. Third, we propose a new CM that enables the simultaneous use of the segment-level labels at different temporal resolutions as well as utterance-level labels to execute utterance- and segment-level detection at the same time. We also show that the proposed CM is capable of detecting spoofing at the utterance level with low error rates in the PS scenario as well as in a related logical access (LA) scenario. The equal error rates of utterance-level detection on the PartialSpoof database and ASVspoof 2019 LA database were 0.77 and 0.90%, respectively.

研究动机与目标

为应对一种新型欺骗威胁——‘部分欺骗’，即仅对语音的短时片段进行合成或转换并嵌入真实语音中。
开发一种能够以精细时间分辨率检测并定位这些短时伪造片段的反制措施。
构建一个新的数据库PartialSpoof，包含六个时间分辨率（20–640 ms）的片段级标注，以支持细粒度欺骗检测研究。
通过使用基于增强SSL的特征提取器，联合训练语音段级和多个片段级标签，提升检测性能。

提出的方法

提出一种基于深度学习的新型反制措施，通过多分辨率标签同时执行语音段级和片段级欺骗检测。
采用自监督预训练模型（如wav2vec 2.0、W2V2-Large、HuBERT、mBART）作为增强的前端，以提升表征学习能力。
引入多分辨率训练策略，使同一模型在20、40、80、160、320和640 ms等多个时间分辨率的片段级标签上进行训练，实现细粒度定位。
设计一种神经架构，聚合不同分辨率的预测结果，并在训练过程中同时利用语音段级和片段级监督信号。
结合交叉熵损失和对比学习目标，以提升模型的鲁棒性和泛化能力。
应用数据增强和类别平衡技术，以应对欺骗数据分布不均衡的问题。

实验结果

研究问题

RQ1在多个时间分辨率上进行训练的反制措施，是否能比现有方法更准确地检测短时伪造片段？
RQ2在部分欺骗场景中，片段级标签与语音段级标签的联合集成，如何提升检测性能？
RQ3在低资源或细粒度设置下，使用自监督预训练模型在多大程度上提升欺骗检测性能？
RQ4该模型在不同欺骗系统（尤其是未知系统）上的泛化能力如何？
RQ5在片段级检测中，开发集与评估集之间性能差距的主要成因是什么？

主要发现

所提出的反制措施在PartialSpoof评估集上实现了0.77%的等错误率（EER），在PS场景的语音段级检测中创下新的SOTA纪录。
在ASVspoof 2019 LA数据库上，该模型实现了0.90%的EER，表现出强大的泛化能力，并在跨场景评估中优于先前方法。
片段级检测中开发集与评估集之间的性能差距，主要源于更具有挑战性的欺骗系统（如A15）以及评估片段中连接边界更少。
留一法消融实验表明，若移除某些未知欺骗系统（尤其是A15），EER显著上升，表明其攻击强度极高。
多分辨率训练对模型性能有显著增益，当片段级标签的时间分辨率与目标检测任务相匹配时，性能进一步提升。
跨场景训练表明，使用PartialSpoof数据可提升LA场景下的性能，表明新数据库在互补性方面具有实用价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。