[论文解读] TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
TG-ASR introduces translation-guided learning with parallel gated cross-attention (PGCA) to improve Taiwanese Hokkien ASR in low-resource settings and releases the YT-THDC 30-hour corpus, achieving a 14.77% relative CER reduction.
Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.
研究动机与目标
- Address the scarcity of transcribed data for low-resource languages (Taiwanese Hokkien) in ASR.
- Propose translation-guided learning with multilingual translation embeddings integrated via PGCA.
- Evaluate auxiliary languages and quantify their impact on ASR performance.
- Release a new 30-hour Taiwanese Hokkien drama corpus aligned with Mandarin subtitles for benchmarking.
提出的方法
- Two-stage training on Whisper Small; first stage fine-tunes encoder and decoder, second stage freezes encoder and fine-tunes PGCA layers.
- Extract multilingual translation embeddings using frozen mBERT from translated auxiliary transcriptions (SeamlessM4T translations) and integrate them into the Whisper decoder via PGCA.
- PGCA mechanism computes Y' = Y + sum_l tanh(alpha_attn^(l)) * attn(Y, E_l, E_l); Z = Y' + tanh(alpha_FNN) * FNN(Y'); where alpha parameters are learnable and initialized to zero.
- Place PGCA modules at the start of each Whisper decoder block to inject multilingual context early in decoding.
- Use parallel cross-attention modules for L auxiliary languages, allowing independent attention branches and gating for each language.
- Evaluate using character error rate (CER) under teacher-forcing decoding and perform ablations to analyze PGCA components.

实验结果
研究问题
- RQ1Does translation-guided learning with PGCA improve ASR for a low-resource language (Taiwanese Hokkien) when auxiliary multilingual translations are available?
- RQ2Which auxiliary languages (and combinations) most effectively improve Taiwanese Hokkien ASR?
- RQ3How does PGCA compare to other fusion strategies (addition, concatenation, sequential/shared attention) for multilingual embeddings?
- RQ4How does the number of auxiliary languages affect ASR performance and is there an optimal subset?
- RQ5Does the translation model quality (SeamlessM4T vs NLLB) influence ASR gains?
主要发现
| Aux. Lang. | CER % | Rel. % |
|---|---|---|
| - | 13.40 | - |
| Mandarin (GT) | 11.87 | 11.42 |
| Hindi | 13.17 | 1.72 |
| English | 13.10 | 2.24 |
| French | 12.98 | 3.13 |
| Spanish | 12.84 | 4.18 |
| Mandarin (GT) + Spanish | 11.42 | 14.77 |
- Full PGCA with five auxiliary languages achieves CER 11.42% on YT-THDC, a 14.77% relative reduction over the baseline.
- Among single-language cues, Mandarin GT yields CER 11.87% and is the strongest single-language supervisor.
- Using translated languages (Hindi, English, French, Spanish) improves over baseline, with Spanish performing best among translated ones (CER 12.84%).
- The best two-language combination (Mandarin + Spanish) produces the strongest CER reduction; adding more languages yields diminishing returns but remains better than single-language supervision.
- Ablation shows tanh gating, parallel multi-branch attention, and independent language-specific attention are beneficial; simple addition or concatenation degrades performance.
- SeamlessM4T-derived auxiliary translations yield better CER (11.42% with A6) than NLLB-derived ones (11.52%), indicating translation quality affects guidance.
- Cross-lingual attention visualizations reveal token-level alignment between Mandarin and Taiwanese Hokkien, supporting effective translation-guided supervision.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。