QUICK REVIEW

[论文解读] SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts

Haibin Wu, Kai-Wei Chang|arXiv (Cornell University)|Jun 3, 2023

Topic Modeling被引用 10

一句话总结

SpeechGen 引入了一种统一的、无文本的提示框架，仅调整提示向量（≈10M 参数）以激发一个语音语言模型用于各种语音生成任务，从而在不更新骨干模型的情况下实现高效的语音到语音生成。

ABSTRACT

Large language models (LLMs) have gained considerable attention for Artificial Intelligence Generated Content (AIGC), particularly with the emergence of ChatGPT. However, the direct adaptation of continuous speech to LLMs that process discrete tokens remains an unsolved challenge, hindering the application of LLMs for speech generation. The advanced speech LMs are in the corner, as that speech signals encapsulate a wealth of information, including speaker and emotion, beyond textual data alone. Prompt tuning has demonstrated notable gains in parameter efficiency and competitive performance on some speech classification tasks. However, the extent to which prompts can effectively elicit generation tasks from speech LMs remains an open question. In this paper, we present pioneering research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen, with around 10M trainable parameters. The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs, which will significantly enhance the capabilities of the framework. The code and demos of SpeechGen will be available on the project website: \url{https://ga642381.github.io/SpeechPrompt/speechgen}

研究动机与目标

Motivate the use of prompt tuning to elicit generation capabilities from speech language models (speech LMs).
Develop a unified, textless framework (SpeechGen) that handles multiple speech generation tasks with a small set of trainable prompts.
Showcase a case study using Unit mBART as the backbone LM for speech translation, inpainting, and continuation.
Demonstrate efficiency, transferability, and affordability of the prompt-tuning approach for future advanced speech LMs.

提出的方法

Use soft prompts inserted at the input of a speech LM to steer generation without changing LM parameters.
In a sequence-to-sequence LM, append encoder and decoder prompts to form the input z = [p^E, x; p^D, y].
Apply deep prompt tuning by replacing the initial K/V prompt keys with trainable prompt vectors to influence multi-layer attention.
Train only the prompt vectors using cross-entropy against target discrete units generated by a self-supervised model (e.g., HuBERT) to produce output units for a unit-based vocoder.
Maintain a textless pipeline where input waveform is converted to discrete units, processed by the prompting LM, then decoded back to waveform via a vocoder.
Demonstrate the framework on three tasks: speech translation, speech inpainting, and speech continuation, using Unit mBART as the backbone LM.

实验结果

研究问题

RQ1Can prompts effectively elicit generation tasks from textless speech LMs?
RQ2How well does a unified, parameter-efficient prompting framework perform across multiple speech generation tasks (translation, inpainting, continuation)?
RQ3What are the trade-offs in using deep prompt tuning to steer a fixed speech LM for speech generation?
RQ4How transferable is SpeechGen to future advanced speech LMs besides Unit mBART?
RQ5What is the feasibility and efficiency of textless speech generation with a small trainable parameter budget (~10M)?

主要发现

BLEU-1	BLEU-2	BLEU-3	BLEU-4
43.8	30.4	21.8	15.9

Speech translation performance achieved BLEU-1 43.8, BLEU-2 30.4, BLEU-3 21.8, BLEU-4 15.9 on Spanish→English using SpeechGen with Unit mBART.
SpeechInpainting results show WER 25.42% and CER 13.85% (SpeechGen) versus corrupted baseline (WER 27.96%, CER 13.47%), indicating room for improvement but feasible restoration.
SpeechContinuation results report perplexity and auto-BLEU across varying conditioned ratios, showing maintained diversity and grammar-related continuations relative to seed segments.
The framework uses approximately 10M trainable parameters (prompt vectors) and does not update the backbone LM, highlighting efficiency and affordability.
SpeechGen demonstrates a textless, versatile, and transferable approach applicable to future speech LMs and tasks beyond the studied examples.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。