QUICK REVIEW

[論文レビュー] Unified Hypersphere Embedding for Speaker Recognition

Mahdi Hajibabaei, Dengxin Dai|arXiv (Cornell University)|Jul 22, 2018

Speech Recognition and Synthesis被引用数 51

ひとこと要約

本論文は、拡張、埋め込み次元数の調整、および新規のロジスティックマージン損失を活用して、追加データやより深いモデルを必要とせずに、識別と検証を改善するテキスト非依存の話者認識のための統一ハイパースフィア埋め込みフレームワークを提案する。

ABSTRACT

Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage costs and cannot be done indefinitely. In this work, we seek to improve the identification and verification accuracy of a text-independent speaker recognition system without use of extra data or deeper and more complex models by augmenting the training and testing data, finding the optimal dimensionality of embedding space and use of more discriminative loss functions. Results of experiments on VoxCeleb dataset suggest that: (i) Simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18%. (ii) Lower dimensional embeddings are more suitable for verification. (iii) Use of proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.

研究の動機と目的

追加データやより深いモデルを使わずに、話者の識別と検証の精度を向上させる。
トレーニングおよびテスト時に適用できるデータ拡張技術を探究する。
検証と識別タスクに対して最適な埋め込み次元数を決定する。

提案手法

3秒の切り出しからSTFTベースの特徴を抽出し、拡張のために発話を繰り返し再生または時系列を反転させる。
埋め込みネットワークとしてResNet-20を用い、512次元の埋め込みを生成する。
Softmax、A-Softmax、AM-Softmax、提案されたロジスティックマージン損失を含むさまざまな識別損失関数で訓練する。
VoxCeleb全体で識別にTop-1/Top-5精度、検証に対してはEER/Cdetで埋め込みを評価する。
識別と検証の性能のトレードオフを評価するために埋め込み次元数(64–512)を比較する。

実験結果

リサーチクエスチョン

RQ1繰り返しと時系列反転による拡張は、追加データなしで識別と検証を改善するか？
RQ2話者検証と識別にとって最適な埋め込み次元数は何か？
RQ3このアーキテクチャにおいて、どの識別損失関数が最も良い識別と検証性能をもたらすか？

主な発見

訓練とテストの両方の段階で適用される拡張は、識別誤差を最大約18％削減する。
低い埋め込み次元（例：64–128）は検証を有利にし、256–512次元は識別を最適化できる。
クラスごとに独立したスケールとバイアスを持つロジスティックマージン損失は、最も強い識別精度を達成し（特に512次元埋め込みで顕著）、検証性能も競争力がある。
ドロップアウトは、複数の損失関数にわたって検証精度を一般的に向上させる；本研究ではドロップアウト付きAM-Softmaxが検証において優れている。
他のVoxCelebのベースラインと比較して、ロジスティックマージンを用いた提案手法は、識別性能をしばしば同等かそれを上回り、検証結果も高水準を維持している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。