QUICK REVIEW

[論文レビュー] ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers

Kaizhi Qian, Shuicheng Yan|arXiv (Cornell University)|Apr 20, 2022

Speech Recognition and Synthesis被引用数 24

ひとこと要約

ContentVec は HuBERT に三つの話者分離機構を追加して、教師・学生・話者条件付けを通じて内容を保持しつつ話者の変動を除去し、内容関連の下流タスクを向上させる。

ABSTRACT

Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus, and then applying the learned representations to downstream tasks. Since the majority of the downstream tasks of SSL learning in speech largely focus on the content information in speech, the most desirable speech representations should be able to disentangle unwanted variations, such as speaker variations, from the content. However, disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well, and the damage of the latter usually far outweighs the benefit of the former. In this paper, we propose a new SSL method that can achieve speaker disentanglement without severe loss of content. Our approach is adapted from the HuBERT framework, and incorporates disentangling mechanisms to regularize both the teacher labels and the learned representations. We evaluate the benefit of speaker disentanglement on a set of content-related downstream tasks, and observe a consistent and notable performance advantage of our speaker-disentangled representations.

研究の動機と目的

自己教師付き音声表現において、重大な内容の損失を伴わずに話者変動を分離する必要性を動機づける。
ContentVec フレームワークは、教師-学生の HuBERT 風学習と三つの分離モジュールを組み合わせて提案する。
話者分離表現がゼロショット Probe や監視付きベンチマークを横断して内容関連タスクで優位性を示す。

提案手法

HuBERT のマスク予測フレームワークを、三つの分離モジュールを含むよう適応させる： (1) 教師における話者情報の除去のための声質変換による分離、 (2) 学習時に話者を変換する変換を用いた SimCLR スタイルの対照学習目的での分離、 (3) Predictor に話者埋め込みを注入して表現が話者情報を持つ必要性を緩和する話者条件付け。
中間層に対して対照的損失を課し、話者情報の流れを積極的に抑制し、入力の二つの話者拡張ビューに対して対称的損失を適用する。
Predictor が話者埋め込みへアクセスできるようにして学生は内容に焦点を当て、教師ラベルは話者劣化のまま。
joint loss L = L_pred + lambda * L_contr を用いて訓練する。ここで L_pred は話者埋め込みを条件とするマスク予測損失、L_contr は SimCLR 風のクロスビュー対照損失。

実験結果

リサーチクエスチョン

RQ1SSL 訓練中に大幅な内容損失を引き起こすことなく話者変動を分離できるか？
RQ2話者分離型 SSL 特徴が下流の内容関連タスクに与える影響は？
RQ3三つの分離機構（教師、学生、条件付け）が性能にどう寄与するか？
RQ4SSL 特徴から導出された離散表現を使用したとき、言語モデルの質は改善されるか？

主な発見

モデル	ABX(w) ↓	ABX(a) ↓	語彙的 ↓	統語的 ↓	PPX ↓	VERT ↓	AUC ↓
ContentVec	5.13	6.32	33.27	43.95	650.04	46.05	45.01
HuBERT-iter	6.01	7.20	34.00	44.36	739.12	47.55	53.28
HuBERT	6.06	7.37	36.19	46.48	790.17	54.35	75.23
Wav2Vec 2.0	8.70	10.34	35.93	46.40	840.34	58.59	88.83

ContentVec は HuBERT や Wav2Vec 2.0 のようなベースラインより内容関連の下流タスクで一貫した改善をもたらす。
ゼロショットの内容プローブでは ContentVec が ABX(w) ABX(a) Lexical Syntactic の各指標で最高の結果を達成し、音素レベルのタスクで最大の利得を得る。
SUPERB の content/semantic タスクでは、表現を下流タスクのために凍結している場合 ContentVec が HuBERT および HuBERT-iter を上回る。
ContentVec は話者識別と方言分類の正確さを低減し、話者分離と部分的な方言分離が効果的であることを示す。
音声変換では ContentVec ベースの表現がターゲット話者の類似性をベースラインより高く示す。
アブレーション研究は三つの分離モジュール（教師、学生、条件付け）が最良の性能には不可欠であることを確認する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。