QUICK REVIEW

[논문 리뷰] End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

Yusuke Fujita, Shinji Watanabe|arXiv (Cornell University)|2020. 02. 24.

Speech Recognition and Synthesis참고 문헌 52인용 수 43

한 줄 요약

본 논문은 화자 다이어라이제이션을 엔드투엔드 다중레이블 프레임 단위 분류로 재정의하고 순열 없는 학습으로, 자기-어텐션 기반 EEND가 클러스터링 기반 방법을 능가하며 겹침을 처리한다는 것을 보여준다.

ABSTRACT

The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these problems, we propose the End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, we formulate the speaker diarization problem as a multi-label classification problem and introduce a permutation-free objective function to directly minimize diarization errors. Besides its end-to-end simplicity, the EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. In contrast to conditioning the network only on its previous and next hidden states, as is done using bidirectional long short-term memory (BLSTM), self-attention is directly conditioned on all the frames. By visualizing the attention weights, we show that self-attention captures global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem.

연구 동기 및 목표

클러스터링 기반 다이어라이제이션 방법의 한계를 동기화하고Address/해소합니다.
화자 다이어라이제이션을 엔드-투-엔드 다중레이블 분류 문제로 공식화합니다.
화자-레이블 순열 문제를 해결하기 위한 순열-없는 학습을 도입합니다.
E2E 다이어라이제이션을 위한 BLSTM 및 셀프 어텐션 아키텍처를 탐구합니다.
시뮬레이션 혼합물과 실제 대화 데이터에서 효과를 입증합니다.

제안 방법

C 스피커에 대해 Y를 프레임 단위 다중 레이블 출력으로 공식화합니다.
스피커 순열들에 걸친 다이어라이제이션 오차를 최소화하기 위해 순열-없는 손실을 도입합니다.
BLSTM 기반 EEND와 Deep Clustering 목표를 셀프 어텐션 기반 EEND와 비교합니다.
두 가지 아키텍처를 사용합니다: DC 손실이 있는 BLSTM-EEND와 인코더 블록 및 다중-헤드 셀프 어텐션이 있는 SA-EEND.
시뮬레이션 혼합물(SimBeta2, SimLarge)과 실제 데이터(Real, Comb)로 학습하고 도메인 적응(CALLHOME, CSJ)합니다.
오버랩 포함 및 칼라( collar) 허용 오차를 포함한 DER로 평가합니다.

실험 결과

연구 질문

RQ1엔드투엔드 다이어라이제이션이 시뮬레이션 및 실제 데이터에서 전통적인 클러스터링 기반 방법을 능가할 수 있는가?
RQ2특히 겹침이 있을 때 셀프 어텐션이 엔드투엔드 다이어라이제이션에서 BLSTM보다 이점을 제공하는가?
RQ3시간 프레임 간 화자-레이블 순열 문제를 순열-없는 학습이 얼마나 효과적인가?
RQ4다른 겹침 조건과 실제 대화 도메인 적응에서 EEND의 성능은 어떠한가?

주요 결과

모델	SimBeta2	SimBeta3	SimBeta5	CH	CSJ
i-vector	33.74	30.93	25.96	12.10	27.99
x-vector	28.77	24.46	19.78	11.53	22.96
BLSTM-EEND (SimBeta2)	12.28	14.36	19.69	26.03	39.33
BLSTM-EEND (Real)	36.23	37.78	40.34	23.07	25.37
SA-EEND (SimBeta2)	7.91	8.51	9.51	13.66	22.31
SA-EEND (Real)	32.72	33.84	36.78	10.76	20.50
SA-EEND (SimLarge)	6.81	6.60	6.40	14.03	21.84
SA-EEND (Comb)	6.92	6.54	6.38	11.99	22.26

셀프 어텐션 SA-EEND가 시뮬레이션 혼합물에서 클러스터링 기반의 기준선보다 DER를 크게 감소시켰고, 특히 높은 겹침에서 차이가 큼.
SA-EEND(SA-EEND (SimLarge) 학습)은 시뮬레이션 테스트에서 DER 6.81–6.60%를 달성하고 실제/테스트 세트에서 14.03%(CH) / 21.84%CSJ를 달성합니다.
BLSTM-EEND는 시뮬레이션 데이터에서 클러스터링 기준선보다 성능이 향상되지만 실제 데이터에서는 SA-EEND보다 약한 성능을 보임.
도메인 적응(CALLHOME)으로 SA-EEND의 DER이 더 감소(예: 적응 시 10.76%)하고 비적응 모델보다 우수한 성능을 보임.
다중-조건 학습(SimLarge, Comb)은 varying 겹침 시나리오에서 견고성을 향상시킴.
적절한 데이터로 학습될 때 SA-EEND는 대부분의 테스트 세트에서 x-vector 및 i-vector 클러스터링 기준선을 DER 측면에서 능가합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.