QUICK REVIEW

[論文レビュー] End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

Yusuke Fujita, Shinji Watanabe|arXiv (Cornell University)|Feb 24, 2020

Speech Recognition and Synthesis参考文献 52被引用数 43

ひとこと要約

この論文は話者分離をエンドツーエンドの多ラベルフレームワイズ分類に再定式化し、置換なしトレーニングを用いて自己注意ベースのEENDがクラスタリングベースの方法よりも優れ、重複を扱えることを示す。

ABSTRACT

The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these problems, we propose the End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, we formulate the speaker diarization problem as a multi-label classification problem and introduce a permutation-free objective function to directly minimize diarization errors. Besides its end-to-end simplicity, the EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. In contrast to conditioning the network only on its previous and next hidden states, as is done using bidirectional long short-term memory (BLSTM), self-attention is directly conditioned on all the frames. By visualizing the attention weights, we show that self-attention captures global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem.

研究の動機と目的

クラスタリングベースの diarization 手法の限界を動機づけ、解決する。
話者 diarization をエンドツーエンドの多ラベル分類問題として定式化する。
話者ラベルの並べ替えを解決するための置換-free訓練を導入する。
E2E ダイアリゼーションのBLSTMおよび自己注意アーキテクチャを検討する。
シミュレーション混合データと実データでの有効性を示す。

提案手法

C speakers のフレーム単位の多ラベル出力として Y を定式化する。
話者の並べ替えをまたぐ diarization 誤差を最小化するための置換-free loss を導入する。
BLSTM-based EEND を Deep Clustering objective と比較して自己注意ベースの EEND と比較する。
二つのアーキテクチャを用いる：BLSTM-EEND with DC loss と SA-EEND with encoder blocks and multi-head self-attention。
SimBeta2、SimLarge のシミュレーション混合データと Real、Comb の実データで訓練し、ドメイン適応を行う（CALLHOME、CSJ）。
重複を含む場合のDERとコラーの許容幅を含む評価を行う。

実験結果

リサーチクエスチョン

RQ1エンドツーエンドの diarization は、シミュレーションデータと実データの両方で従来のクラスタリングベースの手法を上回ることができるか？
RQ2特に重複がある場合に、自己注意はエンドツーエンドの diarization において BLSTM より利点を提供するか？
RQ3時間フレーム間の話者ラベル置換を置換-free 訓練はどれほど効果的に解決できるか？
RQ4さまざまな重複条件と実際の会話ドメインへの適応を通じて EEND はどのように性能を示すか？

主な発見

モデル	SimBeta2	SimBeta3	SimBeta5	CH	CSJ
i-vector	33.74	30.93	25.96	12.10	27.99
x-vector	28.77	24.46	19.78	11.53	22.96
BLSTM-EEND (SimBeta2)	12.28	14.36	19.69	26.03	39.33
BLSTM-EEND (Real)	36.23	37.78	40.34	23.07	25.37
SA-EEND (SimBeta2)	7.91	8.51	9.51	13.66	22.31
SA-EEND (Real)	32.72	33.84	36.78	10.76	20.50
SA-EEND (SimLarge)	6.81	6.60	6.40	14.03	21.84
SA-EEND (Comb)	6.92	6.54	6.38	11.99	22.26

自己注意 EEND (SA-EEND) は、実データでSA-EENDがクラスタリングベースのベースラインを大幅に上回り、特に重複が高い場合に DER を低減する。
SA-EEND with SimLarge 訓練は、シミュレートされたテストで 6.81–6.60% の DER、実データ/テストセットで CALLHOME CH が 14.03%、CSJ が 21.84% となる。
BLSTM-EEND はシミュレーションデータ上でクラスタリングベースを上回るが、実データでは SA-EEND より性能が劣る。
ドメイン適応（CALLHOME）は SA-EEND の DER をさらに低減させ、適応ありモデルで 10.76% などとなる。
マルチ条件訓練 (SimLarge, Comb) は、さまざまな重複シナリオでの頑健性を高める。
SA-EEND は、適切なデータで訓練した場合、DER の点で多くのテストセットで x-vector および i-vector クラスタリングを上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。