QUICK REVIEW

[論文レビュー] Measuring the Redundancy of Decoder Layers in SpeechLLMs

Adel Moumen, Guangzhi Sun|arXiv (Cornell University)|Mar 5, 2026

Speech Recognition and Synthesis被引用数 0

ひとこと要約

要約: 本論文は SpeechLLMs のデコーダー冗長性の大半が事前学習済み LLM 由来であり、共同的ヒーリングによりデコーダーとプロジェクターを併せて調整することで、7–8B モデルで約60%までのデコーダー層をプルーニングしても堅牢な ASR/AST パフォーマンスを維持できることを示している。

ABSTRACT

Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks. Across two LLM families and three scales (1-8B), we show that decoder redundancy is largely inherited from the pretrained LLM: text and speech inputs yield similar redundant blocks. We then measure excess capacity by pruning decoder layers and analysing post-pruning healing to increase robustness. Our findings show that 7-8B models retain good ASR performance with only 60% of decoder layers, and the same trend extends to smaller scales with reduced pruning tolerance. We then generalise to speech translation, and show that the same blocks of layers are redundant across speech encoders, tasks and languages, indicating that a more global redundancy structure exists, enabling a single pruned and multi-tasks SpeechLLM backbone to be deployed.

研究の動機と目的

SpeechLLMs のデコーダーは音声タスク（ASR/AST）に対してすべての容量を使い切っているかを判断する。
モデルサイズとアーキテクチャに応じたデコーダー冗長性のスケーリングを特徴付ける。
プルーニング後のヒーリング挙動とプロジェクター適応の役割を調査する。
ASR から AST へのプルーニングパターンのクロスタスク一般化を評価する。
デコーダーサイズを減らしたグローバルなマルチタスク SpeechLLM バックボーンの可能性を検討する。

提案手法

SLAM フレームワーク（音声エンコーダ、プロジェクター、凍結済み LLM デコーダー）を、2 つの LLM ファミリー（Qwen2.5 と Llama 3.x）、3 つのスケール（1–1.5B、3–4B、7–8B）で適用する。
隠れ状態間の角度距離を用いてデコーダー冗長性を測定し、過度の再学習なしに削除可能な連続ブロックを特定する。
最小角度距離を持つブロックを削除するプルーニング経路を適用し、プルーニング後のヒーリングを LoRA アダプターと任意のプロジェクターのアンフリージングで実施する。
robustness を評価しプルーニング耐性を定量化するために、プロジェクターのみを訓練（ヒーリング用の LoRA アダプターも含む）する。
LibriSpeech と Loquacious で WER を用いた ASR を評価し、CoVoST2（Fr→En, En→De）で BLEU を用いた AST を評価する。
テキストだけのプルーニング経路と音声由来のプルーニング経路を比較し、横断モーダル冗長性パターンを検証する。

Figure 1: Angular distance between decoder layers $\ell$ and $\ell+n$ , averaged over LibriSpeech dev-clean and dev-other. (a) Text-only input; (b) SLAM with frozen decoder; (c) SLAM with LoRA-adapted decoder; (d) SLAM Llama3.1-8B.

実験結果

リサーチクエスチョン

RQ1SpeechLLMs のデコーダー冗長性はテキスト入力と音声入力の両方で事前学習済み LLM のバックボーンから継承されているか。
RQ2ASR に対してターゲット劣化を超えない範囲で、どの程度のデコーダー層をプルーニングでき、モデルサイズとともにどうスケールするか。
RQ3プルーニング後のヒーリング（デコーダー＋プロジェクター）はプルーニングの頑健性にどう影響するか。
RQ4ASR由来のプルーニングパターンがASR+ASTへ転移し、タスクと言語をまたぐグローバルな冗長構造を示すか。

主な発見

デコーダー冗長性は主に事前学習済み LLM 由来であり、テキスト入力と音声入力の両方で削除可能なブロックが類似している。
モデルが大きいほどプルーニング可能性が高くなる：7–8B モデルはデコーダー層の約60%を残した状態でも良好な ASR を維持（Qwen2.5-7B で 63.8%、3–4B で 65.05%、1–1.5B で 86.5%）。
プルーニング後のヒーリングはデコーダーとプロジェクターの両方を適応させると最も効果的（デコーダーの LoRA とプロジェクターのヒーリング）。
ASR- および AST 最適化プルーニングブロックは密接に一致し、同じブロックが言語と音声エンコーダーを跨いで冗長であることを示唆し、グローバルな冗長構造を示唆。
テキストのみのプルーニング経路は音声由来の経路と有効性がほぼ同等で、SpeechLLM のファインチューニングなしにプルーニング判断を可能にする。
プルーニング済みのマルチタスク SpeechLLM バックボーンは、ASR と AST に対して同様の冗長パターンをサポートできる。

Figure 2: Relative WER degradation as a function of the fraction of decoder layers removed, for each evaluation set (columns) and model family (rows). The y-axis shows relative WER ( $\Delta_{\mathrm{WER}}$ ) with respect to the unpruned baselines (Table 1 ); a value of $2.0$ means twice the baselin

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。