[论文解读] Lost in Transcription: How Speech-to-Text Errors Derail Code Understanding
论文提出一个多语言、基于语音的框架,转写与代码相关查询的 Indic 语言,利用大语言模型(LLM)对转写进行 refinement,并评估代码理解任务;在 ASR 与下游代码任务上,LLM 指导的 refinement 显示出显著改进。
Code understanding is a foundational capability in software engineering tools and developer workflows. However, most existing systems are designed for English-speaking users interacting via keyboards, which limits accessibility in multilingual and voice-first settings, particularly in regions like India. Voice-based interfaces offer a more inclusive modality, but spoken queries involving code present unique challenges due to the presence of non-standard English usage, domain-specific vocabulary, and custom identifiers such as variable and function names, often combined with code-mixed expressions. In this work, we develop a multilingual speech-driven framework for code understanding that accepts spoken queries in a user native language, transcribes them using Automatic Speech Recognition (ASR), applies code-aware ASR output refinement using Large Language Models (LLMs), and interfaces with code models to perform tasks such as code question answering and code retrieval through benchmarks such as CodeSearchNet, CoRNStack, and CodeQA. Focusing on four widely spoken Indic languages and English, we systematically characterize how transcription errors impact downstream task performance. We also identified key failure modes in ASR for code and demonstrated that LLM-guided refinement significantly improves performance across both transcription and code understanding stages. Our findings underscore the need for code-sensitive adaptations in speech interfaces and offer a practical solution for building robust, multilingual voice-driven programming tools.
研究动机与目标
- 促进在英语能力有限地区的学习者更易获得的多语言语音代码理解。
- 开发一个从本地语言语音出发,最终输出代码理解结果的端到端流水线。
- 阐明转写错误如何影响下游代码理解任务,并识别代码相关语音中的 ASR 失败模式。
- 证明在转写上由 LLM 指导的 refinement 能提升转写保真度与下游代码理解。
- 对多种语言、数据集与下游任务进行全面评估,以指导设计具备代码感知的语音接口。
提出的方法
- 提出一个多语言语音驱动框架,接受本地语言查询,使用 ASR 转写,利用面向代码的 LLM 提示进行 refinement,并查询代码模型以进行问答与检索。
- 面向代码的转写 refinement,使用 Prompt 工程化的 GPT-4o-mini 来还原误识别的代码术语、纠正发音扭曲,并区分自然语言与编程语言的用法。
- 按语言选择的 ASR 模型(英文使用 Whisper,Indic 语言使用 indic-conformer)以更好地处理混合语言、与代码相关的语音。
- 翻译感知的预处理和 TTS 格式,以在多语言环境中实现自然语言输出与音频反馈。
- 在 CodeSearchNet、CoRNStack 与 CodeQA 上对 Hindi、Gujarati、Tamil、Bengali 与英语进行 Python、Java、PHP 的评测。
- 下游评估使用 Recall@k 与 MRR 衡量代码检索性能,并使用基于模型的评估来衡量问答。
实验结果
研究问题
- RQ1RQ1: ASR 转写错误如何影响下游的代码理解任务,如代码问答与检索?
- RQ2RQ2: 将代码相关语音转写时,ASR 系统常见的失败模式是什么?
- RQ3RQ3: 在资源较低的 Indic 语言(Gujarati、Tamil)与资源较高的语言相比,ASR 和 LLM 组件的表现如何?
- RQ4RQ4: 大型语言模型在多大程度上能够改进代码相关查询的 ASR 转写并提升下游任务的表现?
主要发现
| Lang | Dataset | Model | WER | PER | WFED |
|---|---|---|---|---|---|
| Hindi | CodeSearchNet (CSN) | python-ASR | 44% | 33.4% | 34.5% |
| Hindi | CodeSearchNet (CSN) | python-ASR-R | 30.7% | 15.4% | 7.8% |
| Hindi | CodeSearchNet (CSN) | java-ASR | 44.7% | 24.6% | 19.3% |
| Hindi | CodeSearchNet (CSN) | java-ASR-R | 23.8% | 14.7% | 7.9% |
| Hindi | CodeSearchNet (CSN) | php-ASR | 57.3% | 40.1% | 28.0% |
| Hindi | CodeSearchNet (CSN) | php-ASR-R | 37.7% | 28.3% | 20.0% |
| Hindi | CoRNStack (CSk) | python-ASR | 46.7% | 51.0% | 25.3% |
| Hindi | CoRNStack (CSk) | python-ASR-R | 31.7% | 22.9% | 12.3% |
| Hindi | CoRNStack (CSk) | java-ASR | 48.4% | 41.9% | 37.0% |
| Hindi | CoRNStack (CSk) | java-ASR-R | 39.0% | 37.0% | 26.8% |
| Hindi | CoRNStack (CSk) | php-ASR | 38.8% | 25.8% | 14.9% |
| Hindi | CoRNStack (CSk) | php-ASR-R | 37.7% | 28.3% | 20.0% |
| Hindi | CodeQA (QA) | python-ASR | 61.4% | 57.0% | 34.5% |
| Hindi | CodeQA (QA) | python-ASR-R | 13.5% | 3.6% | 2.1% |
| Hindi | CodeQA (QA) | java-ASR | 46.7% | 51.0% | 25.3% |
| Hindi | CodeQA (QA) | java-ASR-R | 24.2% | 19.6% | 12.2% |
| Hindi | CodeQA (QA) | php-ASR | 38.8% | 25.8% | 14.9% |
| Hindi | CodeQA (QA) | php-ASR-R | 34.5% | 25.3% | 12.6% |
| Gujarati | CodeSearchNet (CSN) | python-ASR | 43% | 33.3% | 16.7% |
| Gujarati | CodeSearchNet (CSN) | python-ASR-R | 38.6% | 21.3% | 11.7% |
| Gujarati | CodeSearchNet (CSN) | java-ASR | 47.6% | 25.0% | 20.1% |
| Gujarati | CodeSearchNet (CSN) | java-ASR-R | 37.7% | 28.4% | 14.2% |
| Gujarati | CodeSearchNet (CSN) | php-ASR | 55.5% | 42.7% | 26.7% |
| Gujarati | CodeSearchNet (CSN) | php-ASR-R | 48.0% | 32.8% | 13.7% |
| Gujarati | CoRNStack (CSk) | python-ASR | 44.6% | 47.2% | 25.2% |
| Gujarati | CoRNStack (CSk) | python-ASR-R | 36.6% | 31.4% | 14.3% |
| Gujarati | CoRNStack (CSk) | java-ASR | 56.8% | 47.0% | 30.6% |
| Gujarati | CoRNStack (CSk) | java-ASR-R | 38.9% | 38.9% | 14.3% |
| Gujarati | CoRNStack (CSk) | php-ASR | 34.5% | 25.3% | 12.6% |
| Gujarati | CoRNStack (CSk) | php-ASR-R | 34.5% | 25.3% | 12.6% |
| Gujarati | CodeQA (QA) | python-ASR | 55.8% | 46.7% | 18.5% |
| Gujarati | CodeQA (QA) | python-ASR-R | 19.4% | 6.8% | 5.3% |
| Gujarati | CodeQA (QA) | java-ASR | 40.8% | 49.0% | 25.9% |
| Gujarati | CodeQA (QA) | java-ASR-R | 31.8% | 24.8% | 13.9% |
| Gujarati | CodeQA (QA) | php-ASR | 34.5% | 25.3% | 12.6% |
| Gujarati | CodeQA (QA) | php-ASR-R | 34.5% | 25.3% | 12.6% |
| Tamil | CodeSearchNet (CSN) | python-ASR | 64.8% | 39.7% | 20.8% |
| Tamil | CodeSearchNet (CSN) | python-ASR-R | 56.6% | 27.2% | 17.0% |
| Tamil | CodeSearchNet (CSN) | java-ASR | 65.6% | 27.4% | 19.3% |
| Tamil | CodeSearchNet (CSN) | java-ASR-R | 52.3% | 25.3% | 14.9% |
| Tamil | CodeSearchNet (CSN) | php-ASR | 73.0% | 42.5% | 26.9% |
| Tamil | CodeSearchNet (CSN) | php-ASR-R | 68.1% | 36.4% | 23.2% |
| Tamil | CoRNStack (CSk) | python-ASR | 47.2% | 50.6% | 23.3% |
| Tamil | CoRNStack (CSk) | python-ASR-R | 52.0% | 45.1% | 22.4% |
| Tamil | CoRNStack (CSk) | java-ASR | 49.5% | 40.5% | 34.1% |
| Tamil | CoRNStack (CSk) | java-ASR-R | 61.8% | 42.3% | 36.6% |
| Tamil | CoRNStack (CSk) | php-ASR | 39.3% | 25.0% | 14.2% |
| Tamil | CoRNStack (CSk) | php-ASR-R | 57.4% | 28.2% | 17.6% |
| Tamil | CodeQA (QA) | python-ASR | 68.4% | 45.5% | 23.3% |
| Tamil | CodeQA (QA) | python-ASR-R | 49.1% | 39.0% | 14.4% |
| Tamil | CodeQA (QA) | java-ASR | 43.8% | 38.1% | 21.2% |
| Tamil | CodeQA (QA) | java-ASR-R | 39.4% | 37.7% | 16.5% |
| Tamil | CodeQA (QA) | php-ASR | 39.3% | 25.0% | 14.2% |
| Tamil | CodeQA (QA) | php-ASR-R | 57.4% | 28.2% | 17.6% |
| Bengali | CodeSearchNet (CSN) | python-ASR | 64.0% | 39.7% | 20.8% |
| Bengali | CodeSearchNet (CSN) | python-ASR-R | 41.9% | 27.1% | 14.4% |
| Bengali | CodeSearchNet (CSN) | java-ASR | 69.0% | 50.7% | 42.8% |
| Bengali | CodeSearchNet (CSN) | java-ASR-R | 47.8% | 37.1% | 27.8% |
| Bengali | CodeSearchNet (CSN) | php-ASR | 61.2% | 34.0% | 20.0% |
| Bengali | CodeSearchNet (CSN) | php-ASR-R | 43.1% | 23.7% | 12.9% |
| Bengali | CoRNStack (CSk) | python-ASR | 54.3% | 53.5% | 22.6% |
| Bengali | CoRNStack (CSk) | python-ASR-R | 42.2% | 41.7% | 19.0% |
| Bengali | CoRNStack (CSk) | java-ASR | 69.0% | 50.7% | 42.8% |
| Bengali | CoRNStack (CSk) | java-ASR-R | 39.4% | 37.7% | 16.5% |
| Bengali | CoRNStack (CSk) | php-ASR | 61.2% | 34.0% | 20.0% |
| Bengali | CoRNStack (CSk) | php-ASR-R | 43.1% | 23.7% | 12.9% |
| Bengali | CodeQA (QA) | python-ASR | 65.4% | 44.6% | 27.6% |
| Bengali | CodeQA (QA) | python-ASR-R | 49.1% | 39.0% | 14.4% |
| Bengali | CodeQA (QA) | java-ASR | 56.6% | 44.0% | 23.0% |
| Bengali | CodeQA (QA) | java-ASR-R | 39.4% | 37.7% | 16.5% |
| Bengali | CodeQA (QA) | php-ASR | 61.2% | 34.0% | 20.0% |
| Bengali | CodeQA (QA) | php-ASR-R | 43.1% | 23.7% | 12.9% |
- 最先进的 ASR 对多语言、代码混合查询常常产生高错误率,对于低资源语言的 WER 常常超过 50%。
- 在转写 refinement 方面使用 GPT-4o-mini 能显著提升转写保真度与各语言、各数据集的下游任务表现。
- 从 refinement 的平均提升为:WER 下降约 21%、PER 下降约 29%、WFED 下降约 33%(在所评估的设置中)。
- 面向代码的 refinement 通过保留更准确的代码术语与结构,提升下游代码任务(QA 与检索)的表现。
- 观测到跨模型的鲁棒性:将 refineR 的 GPT-4o-mini 替换为 Claude Sonnet 4.5 或 Gemini-2.5 Pro 时,性能趋势仍然成立,表明该方法具有广泛适用性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。