Skip to main content
QUICK REVIEW

[论文解读] BUT System Description to VoxCeleb Speaker Recognition Challenge 2019

Hossein Zeinali, Shuai Wang|arXiv (Cornell University)|Oct 16, 2019
Speech Recognition and Synthesis参考文献 12被引用 79
一句话总结

描述BUT团队在VoxSRC 2019中将四个基于CNN的系统(x-vector 和 ResNet34变体)融合,Fixed和Open条件提交分别达到1.42%和1.26% EER。

ABSTRACT

In this report, we describe the submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019. We also provide a brief analysis of different systems on VoxCeleb-1 test sets. Submitted systems for both Fixed and Open conditions are a fusion of 4 Convolutional Neural Network (CNN) topologies. The first and second networks have ResNet34 topology and use two-dimensional CNNs. The last two networks are one-dimensional CNN and are based on the x-vector extraction topology. Some of the networks are fine-tuned using additive margin angular softmax. Kaldi FBanks and Kaldi PLPs were used as features. The difference between Fixed and Open systems lies in the used training data and fusion strategy. The best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.

研究动机与目标

  • Showcase Brno University of Technology (BUT) submissions for VoxSRC 2019 in both Fixed and Open tracks.
  • Compare performance of 4 CNN-based embedding systems (x-vector and ResNet34 variants) under different training data regimes.
  • Analyze backends, fusion strategies, and calibration to achieve competitive EER on VoxCeleb test sets.

提出的方法

  • Use 4 CNN-based embedding networks (two x-vector TDNN variants with PLDA backend, and two ResNet34 variants with cosine backend).
  • Experiment with additive angular margin loss to fine-tune select networks.
  • Train on VoxCeleb-2 development set with substantial augmentation (RIR, Musan) for fixed condition; expand data for Open condition including VoxCeleb-1, LibriSpeech, and DeepMine.
  • Apply Gaussian PLDA and cosine scoring with adaptive score normalization; calibrate and fuse system scores under Fixed (weighted average) and Open (logistic regression) conditions.

实验结果

研究问题

  • RQ1What is the impact of combining multiple CNN-based embeddings (x-vector and ResNet34) on VoxSRC 2019 performance under Fixed and Open conditions?
  • RQ2How do backends (PLDA vs cosine) and augmentation strategies affect verification accuracy?
  • RQ3What fusion and calibration strategies yield the best EER and MinDCF on VoxCeleb test sets?
  • RQ4How does training data choice (Fixed: VoxCeleb-2 only; Open: VoxCeleb-1/2, LibriSpeech, DeepMine) influence results?
  • RQ5Do additive angular margin losses improve discriminability for the ResNet and x-vector systems when fine-tuned?

主要发现

  • Best fixed-condition fusion achieved 1.42% EER on the challenge evaluation set.
  • Best open-condition fusion achieved 1.26% EER on the evaluation set.
  • Open-condition systems trained with broader data (VoxCeleb-1/2, LibriSpeech, DeepMine) outperform fixed-condition counterparts in some metrics due to exposure to more diverse data.
  • ResNet34-based embeddings with cosine scoring and adaptive score normalization perform strongly in open setups.
  • Fusion (weighted averaging for fixed; logistic-regression calibrated fusion for open) provides substantial gains over individual systems.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。