QUICK REVIEW

[Paper Review] Speaker Verification using Convolutional Neural Networks

Hossein Salehghaffari|arXiv (Cornell University)|Mar 14, 2018

Speech Recognition and Synthesis20 references20 citations

TL;DR

This paper proposes a novel end-to-end speaker verification system using a Siamese Convolutional Neural Network (CNN) architecture trained on MFCC features to jointly learn speaker-specific and within-speaker invariant representations. By fine-tuning a pre-trained background model via Siamese learning with an effective pair selection strategy, the method achieves a 10.5% Equal Error Rate (EER) on the VoxCeleb dataset, outperforming traditional GMM-UBM and i-vector baselines.

ABSTRACT

In this paper, a novel Convolutional Neural Network architecture has been developed for speaker verification in order to simultaneously capture and discard speaker and non-speaker information, respectively. In training phase, the network is trained to distinguish between different speaker identities for creating the background model. One of the crucial parts is to create the speaker models. Most of the previous approaches create speaker models based on averaging the speaker representations provided by the background model. We overturn this problem by further fine-tuning the trained model using the Siamese framework for generating a discriminative feature space to distinguish between same and different speakers regardless of their identity. This provides a mechanism which simultaneously captures the speaker-related information and create robustness to within-speaker variations. It is demonstrated that the proposed method outperforms the traditional verification methods which create speaker models directly from the background model.

Motivation & Objective

To improve text-independent speaker verification by learning discriminative speaker representations that capture inter-speaker differences while being robust to intra-speaker variations.
To overcome the limitation of traditional methods that rely on averaging background model outputs for speaker model creation.
To develop an end-to-end trainable system that jointly optimizes for speaker discrimination and robustness using Siamese learning.
To investigate the impact of active pair selection on Siamese network training for improved verification performance.
To demonstrate that fine-tuning a pre-trained CNN via Siamese learning yields better speaker embeddings than standard feature averaging.

Proposed method

A two-stream Siamese CNN architecture is trained to compare pairs of utterances, learning a shared embedding space where same-speaker pairs are close and different-speaker pairs are far apart.
The network is first pre-trained as a classifier on a background model using cross-entropy loss, then fine-tuned using a contrastive loss function with a margin M.
The contrastive loss is defined as: $ L_W = \frac{1}{N} \sum_{i=1}^N \left[ Y \cdot \frac{1}{2} D_W^2 + (1-Y) \cdot \frac{1}{2} \max\{0, M - D_W\}^2 + \lambda \|W\|_2 \right] $, where $ D_W $ is the L2 distance between embeddings.
The Siamese model is trained with an initial learning rate of 0.00001 for 20 epochs, without freezing any layers during fine-tuning.
Speaker models are created by averaging the final embeddings of each speaker’s utterances, and cosine similarity is used for scoring during evaluation.
An active pair selection method is employed to improve training efficiency and performance by prioritizing hard negative pairs.

Experimental results

Research questions

RQ1Can a Siamese CNN architecture trained on MFCCs outperform traditional speaker verification systems like GMM-UBM and i-vector?
RQ2Does fine-tuning a pre-trained background model via Siamese learning improve speaker representation quality compared to simple averaging of embeddings?
RQ3How effective is active pair selection in enhancing the discriminative power of the learned embedding space?
RQ4Can end-to-end training of a CNN for speaker verification achieve better performance than two-stage approaches?
RQ5What is the impact of using a margin-based contrastive loss on the generalization of speaker embeddings?

Key findings

The proposed method achieved a 10.5% Equal Error Rate (EER) on the VoxCeleb test set, significantly outperforming the GMM-UBM baseline (17.1% EER).
The i-vector system with PLDA achieved 11.5% EER, while the proposed CNN-256 with pair selection achieved 10.5% EER, demonstrating a clear improvement.
The Siamese fine-tuning strategy reduced EER by 1.3 percentage points compared to the CNN-2048 baseline (11.3% EER), showing the benefit of discriminative training.
The method outperformed the i-vector + PLDA system, which is considered a strong baseline in speaker verification.
The use of active pair selection during Siamese training contributed to improved convergence and performance over random sampling.
Fine-tuning the entire network without weight freezing led to better generalization than partial fine-tuning, as confirmed by ablation results.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.