QUICK REVIEW

[论文解读] Few Shot Speaker Recognition using Deep Neural Networks

Prashant Anand, Ajeet Kumar Singh|arXiv (Cornell University)|Apr 17, 2019

Speech Recognition and Synthesis参考文献 20被引用 34

一句话总结

本文提出了使用 CNNs 和 CapsuleNet 的少样本说话人识别，结合原型损失，以及一个自编码器用于将胶囊类别向量映射到一个广义嵌入空间，在 VoxCeleb1 和 VCTK 数据集上使用非常短的 3 秒话语进行评估。

ABSTRACT

The recent advances in deep learning are mostly driven by availability of large amount of training data. However, availability of such data is not always possible for specific tasks such as speaker recognition where collection of large amount of data is not possible in practical scenarios. Therefore, in this paper, we propose to identify speakers by learning from only a few training examples. To achieve this, we use a deep neural network with prototypical loss where the input to the network is a spectrogram. For output, we project the class feature vectors into a common embedding space, followed by classification. Further, we show the effectiveness of capsule net in a few shot learning setting. To this end, we utilize an auto-encoder to learn generalized feature embeddings from class-specific embeddings obtained from capsule network. We provide exhaustive experiments on publicly available datasets and competitive baselines, demonstrating the superiority and generalization ability of the proposed few shot learning pipelines.

研究动机与目标

在非常有限的数据和极短的发音时进行实际可用的说话人识别的动机。
提出使用频谱图输入和原型损失的少样本学习流程。
评估 CNN 和 Capsule Network 方法，并通过一个自编码器实现对未见说话人的泛化。
证明原型损失在不同架构下提升少样本性能。

提出的方法

将音频转换为单通道 16 kHz、16-bit 流，并为每个 3 秒的发音计算 128x300 的时频图。
使用 CNN 基线（VGG-M、ResNet-34）和经过修改的 Capsule Network（CapsuleNet-M）作为特征提取器。
通过自编码器扩展 CapsuleNet，生成适用于原型损失的通用嵌入。
应用原型损失在嵌入空间中学习类别原型，以进行少样本分类。
在少样本设置下结合一个收缩性自编码器，从胶囊类别向量中生成嵌入（CapsuleNet-MA）。
端到端训练并在 5 类和 20 类、1-shot 和 5-shot 条件下进行评估。

实验结果

研究问题

RQ1少样本学习是否能够在 3 秒话语下实现准确的说话人识别？
RQ2在少样本条件下，CNN 和 Capsule Network 方法的表现有何差异？
RQ3将胶囊派生的类别向量通过自编码器映射是否有助于对未见说话人的泛化？
RQ4原型损失是否在不同架构下提升少样本说话人识别的性能？

主要发现

ResNet-34 在标准（非少样本）VoxCeleb1 子集上显著优于其他网络，在 50 类时的 Top-1 为 90.37%，Top-5 为 98.13%，在 200 类时的 Top-1 为 71.48%，Top-5 为 88.45%。
在少样本设置中，ResNet-34 在 5 类的 VoxCeleb1 上的 1-shot 为 79.97%，5-shot 为 91.50%；而 CapsuleNet-MA 在 1-shot 为 53.62%，5-shot 为 82.93%，VGG-M 在 1-shot 为 52.42%，5-shot 为 82.10%。
CapsuleNet-MA 在若干少样本设置中优于 VGG-M，并趋近于具有更少参数的 ResNet；而标准 CapsuleNet（CapsuleNet-M）落后于 ResNet，但与 VGG-M 仍具竞争力。
在 VCTK 语料库中，非少样本结果显示 CapsuleNet-M 的 Top-1 为 91.95%，Top-5 为 98.13%；VGG-M 的 Top-1 为 95.25%，Top-5 为 99.45%；ResNet-34 的 Top-1 为 96.91%，Top-5 为 99.91%。
在少样本 VCTK 设置中，CapsuleNet-MA 的 5 类 1-shot 为 65.26%，5 类 5-shot 为 91.28%，而 ResNet-34 的 5 类 1-shot 为 80.96%，5 类 5-shot 为 96.46%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。