QUICK REVIEW

[論文レビュー] TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition

Cheng-Yeh Yang, Chien-Chun Wang|arXiv (Cornell University)|Feb 25, 2026

Speech Recognition and Synthesis被引用数 0

ひとこと要約

TG-ASRは翻訳ガイド付き学習とPGCAで低リソースの台湾語ミン語ASRを向上させ、Mandarin字幕付きのYT-THDCコーパスを公開。

ABSTRACT

Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.

研究の動機と目的

ASRにおける低リソース言語（台湾語ミン語）向けの転写データの不足を解消する。
PGCAを統合した多言語翻訳埋め込みによる翻訳ガイド付き学習を提案する。
補助言語を評価し、ASR性能への影響を定量化する。
Mandarin字幕と整列した新しい30時間の台湾語ミン語ドラマコーパスを公開し、ベンチマークとする。

提案手法

Whisper Smallでの2段階訓練；第1段階でエンコーダとデコーダを微調整、第2段階でエンコーダを固定しPGCA層を微調整。
翻訳補助転写（SeamlessM4T翻訳）から凍結されたmBERTを用いて多言語翻訳埋め込みを抽出し、PGCAを介してWhisperデコーダに統合。
PGCAの機構はY' = Y + sum_l tanh(alpha_attn^(l)) * attn(Y, E_l, E_l); Z = Y' + tanh(alpha_FNN) * FNN(Y'); ここでalphaパラメータは学習可能で初期値は0。
デコーダブロックの先頭にPGCAモジュールを配置し、デコード初期段階で多言語コンテキストを注入。
L個の補助言語に対して並列クロスアテンションモジュールを使用し、各言語のアテンション分岐とゲーティングを独立に行う。
教師強制デコード下でCERを用いて評価し、PGCAコンポーネントを分析するアブレーションを実施。

Figure 1 : Illustration of the Taiwanese Hokkien drama subtitles. (a) A scene with spoken Taiwanese Hokkien and existing Mandarin subtitles enclosed in a blue box 2 2 2 The meaning of the subtitle is “How could he possibly get involved in such a thing?” in English. Images were adapted from publicly

実験結果

リサーチクエスチョン

RQ1翻訳ガイド付き学習とPGCAは補助的多言語翻訳が利用可能な場合、低リソース言語（台湾語ミン語）のASRを改善するか。
RQ2どの補助言語（組み合わせ）が台湾語ミン語ASRを最も効果的に改善するか。
RQ3PGCAは多言語埋め込みの他の融合戦略（加算、連結、逐次/共有アテンション）と比べてどうか。
RQ4補助言語の数はASR性能にどのように影響し、最適なサブセットはあるか。
RQ5翻訳モデルの品質（SeamlessM4T対NLLB）はASRの改善に影響を与えるか。

主な発見

Aux. Lang.	CER %	Rel. %
-	13.40	-
Mandarin (GT)	11.87	11.42
Hindi	13.17	1.72
English	13.10	2.24
French	12.98	3.13
Spanish	12.84	4.18
Mandarin (GT) + Spanish	11.42	14.77

五つの補助言語を含む完全なPGCAはYT-THDCでCER 11.42%を達成し、ベースラインに対して相対削減14.77%。
単一言語の手がかりの中で、Mandarin GTがCER 11.87%を示し、最も強力な単一言語の監視 superviser。
翻訳済み言語（Hindi, English, French, Spanish）を用いるとベースラインを上回り、翻訳言語の中ではスペイン語が最も良好（CER 12.84%）。”
最良の二言語組み合わせ（Mandarin + Spanish）が最も強いCER削減を生み出し、それ以上の言語を追加しても利益は減少するが単言語監視よりは依然として良好。
アブレーションによりtanhゲーティング、並列多分岐アテンション、言語固有の独立アテンションが有益であり、単純な加算や連結は性能を低下させることが示された。
SeamlessM4T由来の補助翻訳はNLLB由来よりCERが良く（A6で11.42%）、翻訳品質がガイダンスに影響することを示唆。
多言語間クロスアテンションの可視化は、Mandarinと台湾語ミン語間のトークンレベルの整列を示し、有効な翻訳ガイド付き監視を支持。

Figure 2 : The architecture of the proposed TG-ASR, which leverages our novel parallel gated cross-attention (PGCA) mechanism to integrate multilingual translated transcription inputs for improved knowledge transfer in ASR.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。