[Paper Review] Attention-Based Models for Text-Dependent Speaker Verification
The paper injects attention mechanisms into an end-to-end text-dependent speaker verification system, showing improvements in EER over a non-attention LSTM baseline, with best results from divided-layer attention and sliding window pooling.
Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.
Motivation & Objective
- Improve text-dependent speaker verification by focusing on phoneme-relevant frames using attention mechanisms.
- Compare multiple attention layer topologies and pooling methods within an end-to-end TD-SV framework.
- Quantify improvements in verification accuracy as measured by Equal Error Rate (EER).
Proposed method
- Use an end-to-end LSTM-based d-vector framework for TD-SV with keyword-based segments.
- Introduce attention layers to compute frame-wise weights and form a weighted d-vector.
- Explore scoring functions: bias-only, linear, shared-parameter linear, non-linear, and shared-parameter non-linear.
- Propose attention layer variants: cross-layer attention and divided-layer attention.
- Apply attention weights pooling methods: no pooling, sliding window maxpooling, and global top-K maxpooling.
Experimental results
Research questions
- RQ1Does adding attention improve EER over the baseline end-to-end TD-SV model?
- RQ2Which attention scoring function yields the best performance?
- RQ3Do attention layer variants (cross-layer, divided-layer) provide advantages over basic attention?
- RQ4Does pooling attention weights (sliding window or top-K) further improve verification performance?
Key findings
- Attention-based models reduce EER relative to the baseline: from 1.72% to 1.63% on average with basic attention and further to 1.63% or better with optimized variants.
- Shared-parameter non-linear attention with divided-layer connection yields better average EER than other configurations (1.56% vs 1.63% for basic).
- Divided-layer attention outperforms cross-layer attention across evaluation sets.
- Sliding window maxpooling on attention weights improves EER to 1.48% average, outperforming no-pooling and top-K pooling.
- Best practice combination achieves a 14% relative improvement over the non-attention baseline (1.72% to 1.48%).
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.