QUICK REVIEW

[论文解读] Unified Hypersphere Embedding for Speaker Recognition

Mahdi Hajibabaei, Dengxin Dai|arXiv (Cornell University)|Jul 22, 2018

Speech Recognition and Synthesis被引用 51

一句话总结

本文提出一个统一的超球嵌入框架用于文本无关的说话人识别，利用数据增强、嵌入维度调优，以及一种新颖的对数边距损失来在不需要额外数据或更深模型的情况下提升识别与验证性能。

ABSTRACT

Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage costs and cannot be done indefinitely. In this work, we seek to improve the identification and verification accuracy of a text-independent speaker recognition system without use of extra data or deeper and more complex models by augmenting the training and testing data, finding the optimal dimensionality of embedding space and use of more discriminative loss functions. Results of experiments on VoxCeleb dataset suggest that: (i) Simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18%. (ii) Lower dimensional embeddings are more suitable for verification. (iii) Use of proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.

研究动机与目标

在不需要额外数据或更深模型的情况下提升说话人识别与验证的准确性。
探索可在训练与测试阶段应用的数据增强技术。
确定用于验证与识别任务的最优嵌入维度。

提出的方法

从3秒裁剪中提取基于STFT的特征，并通过重复或时间反转扩展话语以进行增强。
使用ResNet-20作为嵌入网络，生成512维嵌入。
使用包括Softmax、A-Softmax、AM-Softmax在内的多种判别损失函数进行训练，并采用提出的对数边距损失。
在VoxCeleb上用识别的Top-1/Top-5准确率以及验证的EER/Cdet对嵌入进行评估。
比较嵌入维度（64–512）以评估识别与验证性能之间的权衡。

实验结果

研究问题

RQ1通过重复和时间反转进行的增强是否在不额外数据的情况下改善识别与验证？
RQ2说话人验证与识别的最优嵌入维度是多少？
RQ3对于该架构，哪种判别损失函数能带来最佳的识别与验证性能？

主要发现

在训练和测试阶段均应用的增强将识别误差降低最多约18%。
较低的嵌入维度（例如64–128）有利于验证，而256–512维可以优化识别。
具有独立类别尺度和偏置的对数边距损失在识别精度方面表现最佳（尤其是512维嵌入时），并且在验证性能上具有竞争力。
Dropout在多种损失函数下通常提升验证准确性；在本研究中，带Dropout的AM-Softmax在验证方面表现突出。
与其他VoxCeleb基线相比，提出的对数边距方法常常达到或超过识别性能，同时保持强劲的验证结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。