QUICK REVIEW

[論文レビュー] TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation.

Yi Luo, Nima Mesgarani|arXiv (Cornell University)|Sep 20, 2018

Speech and Audio Processing参考文献 47被引用数 74

ひとこと要約

TasNet は、時間周波数表現を回避するエンドツーエンドの時間ドメインディープラーニングフレームワークを提案する。畳み込みエンコーダ、拡張時間畳み込みを用いた学習可能なマスク、線形デコーダを用いる。理想の時間周波数マスクを上回り、低遅延かつ小型モデルサイズを実現し、リアルタイムかつ高精度なスピーカー分離を可能にする。

ABSTRACT

Robust speech processing in multitalker acoustic environments requires automatic speech separation. While single-channel, speaker-independent speech separation methods have recently seen great progress, the accuracy, latency, and computational cost of speech separation remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of spectrogram representations for speech separation, and the long latency in calculating the spectrogram. To address these shortcomings, we propose the time-domain audio separation network (TasNet), which is a deep learning autoencoder framework for time-domain speech separation. TasNet uses a convolutional encoder to create a representation of the signal that is optimized for extracting individual speakers. Speaker extraction is achieved by applying a weighting function (mask) to the encoder output. The modified encoder representation is then inverted to the sound waveform using a linear decoder. The masks are found using a temporal convolutional network consisting of dilated convolutions, which allow the network to model the long-term dependencies of the speech signal. This end-to-end speech separation algorithm significantly outperforms previous time-frequency methods in terms of separating speakers in mixed audio, even when compared to the separation accuracy achieved with the ideal time-frequency mask of the speakers. In addition, TasNet has a smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward actualizing speech separation for real-world speech processing technologies.

研究の動機と目的

時間周波数表現の制限（位相とマグニチュードの分離、高遅延など）を克服すること。
実用的デプロイメントに適したリアルタイムで低遅延の音声分離システムを開発すること。
エンドツーエンドのディープラーニングアプローチを用いて、理想の時間周波数マスクを上回る分離精度を向上させること。
従来の時間周波数ベースの手法と比較して、計算コストとモデルサイズを低減すること。

提案手法

TasNet は、生波形をスピーカー分離に最適化された表現に変換する畳み込みエンコーダを用いる。
拡張畳み込みを用いた時間的畳み込みネットワークにより、学習可能な時間連続マスクを適用し、個々のスピーカー成分を抽出する。
マスク処理された表現は線形デコーダを用いて波形に再構成され、エンドツーエンド学習が可能になる。
拡張畳み込みにより、受容 field の大きさを著しく増大させることなく、長期的な音声依存関係をモデル化できる。
推定波形とターゲット波形の差を最小化する損失関数を用いて、システム全体をエンドツーエンドで学習する。
スペクトログラムの計算を回避することで、位相・マグニチュードの分離を排除し、遅延を低減する。

実験結果

リサーチクエスチョン

RQ1エンドツーエンドの時間ドメインアプローチは、時間周波数ベースの音声分離手法を上回ることができるか？
RQ2時間ドメインで学習されたモデルは、理想の時間周波数マスクの性能を上回ることができるか？
RQ3時間ドメインシステムは、従来の手法と比較してより低い遅延とより小さなモデルサイズを達成できるか？
RQ4拡張畳み込みは、スピーカー分離における長期的な音声依存関係をどれほど効果的にモデル化できるか？

主な発見

TasNet は、最先端の時間周波数手法を上回る優れた音声分離性能を達成しており、一部のケースでは理想の時間周波数マスクをも上回る。
生波形処理により、スペクトログラムベースの手法と比較して顕著に低い遅延を実現する。
TasNet は小型モデルサイズであるため、リアルタイムおよびリソース制限のあるアプリケーションに適している。
拡張畳み込みの使用により、時間ドメインにおける長期的音声依存関係の効果的なモデル化が可能になった。
時間ドメインでのエンドツーエンド学習により、位相再構成の必要がなく、最適でないスペクトログラム表現を回避できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。