QUICK REVIEW

[论文解读] WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

Aurchi Chowdhury, Rubaiyat -E-Zaman|arXiv (Cornell University)|Mar 5, 2026

Speech Recognition and Synthesis被引用 0

一句话总结

该论文提出一个用于长篇孟加拉语ASR与说话人分离的双阶段流水线，使用词边界感知的分块和领域适配的独占说话人分离，并结合WhisperX-VAD 交集以在低资源环境中实现强的WER与DER。

ABSTRACT

This paper presents our solution for the DL Sprint 4.0, addressing the dual challenges of Bengali Long-Form Speech Recognition (Task 1) and Speaker Diarization (Task 2). Processing long-form, multi-speaker Bengali audio introduces significant hurdles in voice activity detection, overlapping speech, and context preservation. To solve the long-form transcription challenge, we implemented a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription. For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX. A key contribution of our approach is the domain-specific fine-tuning of the Pyannote segmentation model on the competition dataset. This adaptation allowed the model to better capture the nuances of Bengali conversational dynamics and accurately resolve complex, overlapping speaker boundaries. Our methodology demonstrates that applying intelligent timestamped chunking to ASR and targeted segmentation fine-tuning to diarization significantly drives down Word Error Rate (WER) and Diarization Error Rate (DER), in low-resource settings.

研究动机与目标

在低资源环境中解决孟加拉语长篇语音识别与说话人分离的挑战。
开发一个自包含的、对词边界敏感的分块流水线，将精准片段输入到孟加拉语 Whisper 模型。
在与竞赛相关的片段上对孟加拉 Whisper checkpoint 进行微调以提升 WER。
将 Pyannote 分割适配到孟加拉语韵律，并在快速推理流水线中实现独占重叠处理。

提出的方法

Silero VAD 识别语音区域以避免边界截断。
Whisper-timestamped 通过跨注意力头推导每词时间戳以实现词边界对齐。
Difflib 基于对齐将真实时间戳转移到 Whisper 转录文本，并对缺失锚点进行插值。
将音频分块为 28 秒的片段，确保词边界并保留 20–28s 以用于微调。
对 bengaliAI/tugstugi_bengaliai-asr_whisper-medium 进行端到端微调，使用教师学习和 5 个 epoch。
推理阶段使用 VAD 指引的并行处理并通过后处理过滤器去除重复和英文模板语句。

Figure 1: End-to-end training data pipeline: from raw long-form audio to aligned, boundary-respecting chunks for fine-tuning.

实验结果

研究问题

RQ1词边界感知分块与帧对齐时间戳是否能在长篇孟加拉语 ASR 中减少幻听并保留上下文？
RQ2在边界感知片段上的领域自适应微调是否能提升孟加拉 Whisper 的 WER？
RQ3孟加拉语适配的 Pyannote 分析在排他性重叠处理下是否能在不违反竞赛非重叠要求的同时保持高 DER？
RQ4将 WhisperX VAD 与 Pyannote 输出交集是否能减少分离中的时间漂移与环境幻听？

主要发现

System	Public WER	Private WER
tugstugi — raw, no processing	0.675	0.702
+ VAD + post-processing	0.419	0.440
+ Unicode normalization	0.348	0.375
+ Fine-tuned (our chunking strategy)	0.265	0.296
+ Manual data cleaning (final)	0.252	0.278

WER 在流水线各阶段从 0.675 提升至 0.252–0.278（公开/私有评分）。
使用分块对齐数据进行微调带来最大的 WER 下降（公开 0.265，私有 0.296）。
结合 VAD 与后处理可实现较大早期提升（公开 0.419，私有 0.440）。
独占说话人分离与社区-1 基模型相比基线 Pyannote 3.1 显著提升分离性能。
WhisperX-VAD 与 Pyannote 的交集消除了边界漂移并降低了分离中的幻听。

Figure 2: Proposed parallel diarization architecture

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。