Skip to main content
QUICK REVIEW

[论文解读] Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR

Zilai Wang, Natarajan Balaji Shankar|arXiv (Cornell University)|Jan 28, 2026
Speech Recognition and Synthesis被引用 0
一句话总结

Delta embeddings fusion of fine-tuned SSL models with delta representations improves child ASR, achieving new state-of-the-art on the MyST corpus (WER 9.64) and notable gains in ultra-low-resource settings.

ABSTRACT

Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. Fine-tuning SSL models on child speech induces shifts in the representation space. We hypothesize that delta SSL embeddings, defined as the differences between embeddings from a finetuned model and those from its pretrained counterpart, encode task-specific information that complements finetuned features from another SSL model. We evaluate multiple fusion strategies on the MyST childrens corpus using different models. Results show that delta embedding fusion with WavLM yields up to a 10 percent relative WER reduction for HuBERT and a 4.4 percent reduction for W2V2, compared to finetuned embedding fusion. Notably, fusing WavLM with delta W2V2 embeddings achieves a WER of 9.64, setting a new state of the art among SSL models on the MyST corpus. These findings demonstrate the effectiveness of delta embeddings and highlight feature fusion as a promising direction for advancing child ASR.

研究动机与目标

  • Investigate whether delta SSL embeddings (differences between fine-tuned and pre-trained embeddings) can capture task-specific information for child ASR.
  • Examine whether fusing delta embeddings with fine-tuned embeddings from different SSL encoders yields complementary representations.
  • Identify fusion strategies that maximize performance in low-resource and few-shot regimes for child speech recognition.
  • Provide analytical insight into why delta embeddings improve fusion via representational similarity analyses.

提出的方法

  • Define delta embeddings as E_delta = E_ft - E_pt for each SSL model.
  • Fuse fine-tuned embeddings from one model with delta embeddings from another using Concat, Weighted, and Cross-Attention strategies.
  • Remove the upper CTC layer from the fine-tuned model and train a new linear CTC head on frozen fused features.
  • Evaluate on the MyST corpus with full and 1h, 5h, 10h low-resource subsets.
  • Use Canonical Correlation Analysis (PWCCA) to assess representational similarity between fine-tuned, pre-trained, and delta embeddings.
  • Analyze Mixture-of-Experts gating to interpret frame-level contributions of each embedding type.]
  • research_questions:[
Fig. 1 : CCA similarity between pre-trained and fine-tuned models.
Fig. 1 : CCA similarity between pre-trained and fine-tuned models.

实验结果

研究问题

  • RQ1Can delta embeddings capture task-specific shifts that complement fine-tuned SSL representations for child ASR?
  • RQ2Which fusion strategy (Concat, Weighted, Cross-Attn) best leverages delta embeddings in child ASR?
  • RQ3Do delta embeddings provide greater gains in ultra-low-resource scenarios (e.g., 1 hour training data)?
  • RQ4How do delta embeddings affect inter-model complementarity as measured by CCA/PWCCA and gating in MoE?
  • RQ5Is cross-domain delta embedding information transferable to child ASR when fine-tuned on non-child data?

主要发现

  • Concatenation consistently outperforms Weighted and Cross-Attention in fusion of WavLM with delta embeddings across data regimes.
  • Delta W2V2 fusion with WavLM achieves the best results, including a WER of 9.64 on full MyST data (state-of-the-art among SSL models on MyST).
  • Delta HuBERT fusion also yields significant gains, especially in low-resource settings (e.g., 1h), with up to 10% relative WER reduction versus fine-tuned HuBERT.
  • Delta embeddings provide additional gains in 1h, 5h, 10h, and full data settings, with notable 4.4% relative WER reduction for Delta W2V2 in 1h."
  • Cross-domain deltas (LibriSpeech-tuned) improve over baselines, confirming task-specific information in deltas and some transferability to child ASR.
  • MoE analysis shows both fine-tuned and delta embeddings contribute substantially, with W2V2 offering greater complementarity to WavLM than HuBERT.
Fig. 2 : CCA similarity between fine-tuned and $\Delta$ embeddings.
Fig. 2 : CCA similarity between fine-tuned and $\Delta$ embeddings.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。