QUICK REVIEW

[論文レビュー] SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts

Haibin Wu, Kai-Wei Chang|arXiv (Cornell University)|Jun 3, 2023

Topic Modeling被引用数 10

ひとこと要約

SpeechGen は unified でテキストレスな prompting フレームワークを導入し、.prompt ベクトル（≈10M パラメータ）のみを調整して、バックボーンモデルを更新することなくスピーチ言語モデルを刺激して様々な話者生成タスクを実現します。

ABSTRACT

Large language models (LLMs) have gained considerable attention for Artificial Intelligence Generated Content (AIGC), particularly with the emergence of ChatGPT. However, the direct adaptation of continuous speech to LLMs that process discrete tokens remains an unsolved challenge, hindering the application of LLMs for speech generation. The advanced speech LMs are in the corner, as that speech signals encapsulate a wealth of information, including speaker and emotion, beyond textual data alone. Prompt tuning has demonstrated notable gains in parameter efficiency and competitive performance on some speech classification tasks. However, the extent to which prompts can effectively elicit generation tasks from speech LMs remains an open question. In this paper, we present pioneering research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen, with around 10M trainable parameters. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs, which will significantly enhance the capabilities of the framework. The code and demos of SpeechGen will be available on the project website: \url{https://ga642381.github.io/SpeechPrompt/speechgen}

研究の動機と目的

speech LMs から生成能力を引き出すための prompting tuning の利用を動機付ける。
小さな訓練可能なプロンプト集合で複数の話者生成タスクを扱う統一的なテキストレス枠組み（SpeechGen）を開発する。
Unit mBART をバックボーン LM として、 speech translation、インペイント、継続のケーススタディを示す。
将来の高度な話者 LM のための prompt-tuning アプローチの効率、転移性、費用対効果を実証する。

提案手法

話者 LM のパラメータを変更せず、入力にソフトプロンプトを挿入して生成を誘導する。
シーケンス・トゥ・シーケンス LM において、encoder プロンプトと decoder プロンプトを結合して入力 z = [p^E, x; p^D, y] を形成する。
初期の K/V プロンプトキーを訓練可能なプロンプトベクターに置換して深いプロンプト調整を適用し、多層アテンションに影響を与える。
自己教師付きモデル（例：HuBERT）によって生成されたターゲット離散単位に対してクロスエントロピーでのみプロンプトベクターを訓練し、ユニットベースのボコーダの出力単位を生成する。
入力波形を離散単位へ変換し、 prompting LM で処理し、ボコーダを介して波形へデコードするテキストレスのパイプラインを維持する。
このフレームワークを Speech Translation、Speech Inpainting、Speech Continuation の三つのタスクで、Unit mBART をバックボーン LM としてデモする。

実験結果

リサーチクエスチョン

RQ1テキストレスの話者 LM から生成タスクを促進するプロンプトは有効か。
RQ2統一的でパラメータ効率の高い prompting フレームワークは、翻訳、インペイント、継続という複数の話者生成タスクでどの程度機能するか。
RQ3固定された話者 LM を話者生成へと導く深いプロンプト調整にはどんなトレードオフがあるか。
RQ4SpeechGen は Unit mBART 以外の将来の高度な話者 LM へどれだけ転用可能か。
RQ5小さな訓練可能パラメータ予算（≈10M）でテキストレスの話者生成の実現性と効率性はどうか。

主な発見

BLEU-1	BLEU-2	BLEU-3	BLEU-4
43.8	30.4	21.8	15.9

Speech translation の性能は BLEU-1 43.8、BLEU-2 30.4、BLEU-3 21.8、BLEU-4 15.9 を、Unit mBART を用いた SpeechGen で達成。
SpeechInpainting の結果は WER 25.42%、CER 13.85%（SpeechGen）で、破損ベースライン（WER 27.96%、CER 13.47%）と比較して改善の余地があるが復元可能。
SpeechContinuation の結果は困難度条件比によるパープレックス性および自動 BLEU を報告し、シードセグメントと比較して多様性と文法関連の継続を維持。
フレームワークは約 10M の訓練可能パラメータ（プロンプトベクター）を使用し、バックボーン LM を更新しない点で効率性と費用対効果を強調。
SpeechGen はテキストレスで多用途・転用性の高いアプローチを示し、 studied 例を超えた将来の話者 LM やタスクにも適用可能。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。