QUICK REVIEW

[论文解读] Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation

Alexandre Bérard, Olivier Pietquin|arXiv (Cornell University)|Dec 6, 2016

Natural Language Processing Techniques参考文献 13被引用 208

一句话总结

该论文提出一个基于注意力机制的编码器-解码器网络的端到端语音转文本翻译系统，比较语音翻译与文本翻译，并在一个小型合成法-英语语料库上进行评估。

ABSTRACT

This paper proposes a first attempt to build an end-to-end speech-to-text translation system, which does not use source language transcription during learning or decoding. We propose a model for direct speech-to-text translation, which gives promising results on a small French-English synthetic corpus. Relaxing the need for source language transcription would drastically change the data collection methodology in speech translation, especially in under-resourced scenarios. For instance, in the former project DARPA TRANSTAC (speech translation from spoken Arabic dialects), a large effort was devoted to the collection of speech transcripts (and a prerequisite to obtain transcripts was often a detailed transcription guide for languages with little standardized spelling). Now, if end-to-end approaches for speech-to-text translation are successful, one might consider collecting data by asking bilingual speakers to directly utter speech in the source language from target language text utterances. Such an approach has the advantage to be applicable to any unwritten (source) language.

研究动机与目标

研究为何需要端到端的语音转文本翻译，而无需源语言转录。
提出并比较两个端到端模型：一个用于文本翻译，一个用于语音翻译，均使用注意力机制。
评估在一个小型、特定领域语料库上进行端到端翻译的可行性。
展示使用合成语音数据对跨说话者变异的鲁棒性潜力。

提出的方法

对文本翻译和语音翻译均采用基于注意力的编码器-解码器神经网络。
使用双向 LSTM 编码器和带注意力的两层 LSTM 解码器来生成目标序列。
对于文本输入，应用 Bahdanau 风格的注意力机制；对于语音输入，使用带记忆的卷积注意力模型，通过卷积滤波器记住先前的注意力。
采用 Adam 优化并在编码器和解码器之间应用 dropout。
为语音模型实现分层编码器以减少输入序列长度，并使用 40 MFCC 特征表示来表示语音输入。
用贪婪解码和束搜索解码进行评估，并与传统的 SMT 基线进行比较。

实验结果

研究问题

RQ1是否可以在不依赖源语言转录的前提下训练端到端的语音转文本翻译模型？
RQ2端到端语音翻译的性能相较于文本翻译和基线 SMT 系统在一个小型法-英合成语料库上的表现如何？
RQ3端到端方法是否能在未经明确说话人自适应的情况下对新说话人进行泛化？
RQ4解码策略（贪婪解码、带/不带语言模型的束搜索）对翻译质量有何影响？

主要发现

在一个小型法-英合成语料库上，端到端的语音翻译在多种解码设置下得到的 BLEU 分数与基线 SMT 系统具有竞争力。
由五个模型组成的集成并结合语言模型，在开发集和测试集上接近 SMT 基线的 BLEU 分数。
语音翻译模型对于不在训练中的新说话人仍保持相当鲁棒，显示在没有说话人自适应的情况下的潜在泛化能力。
训练时间较短（文本模型约 2 小时，语音模型约 8 小时，在 GTX 1070 上），展示了快速实验的可行性。
研究证实端到端模型能够同时学习对齐与翻译，如注意力对齐图所示。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。