QUICK REVIEW

[論文レビュー] Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation

Alexandre Bérard, Olivier Pietquin|arXiv (Cornell University)|Dec 6, 2016

Natural Language Processing Techniques参考文献 13被引用数 208

ひとこと要約

この論文は、attention-based encoder-decoder ネットワークに基づくエンドツーエンドの speech-to-text translation システムを提示し、 speech translation を text translation と比較し、小さな合成 French–English コーパスで評価している。

ABSTRACT

This paper proposes a first attempt to build an end-to-end speech-to-text translation system, which does not use source language transcription during learning or decoding. We propose a model for direct speech-to-text translation, which gives promising results on a small French-English synthetic corpus. Relaxing the need for source language transcription would drastically change the data collection methodology in speech translation, especially in under-resourced scenarios. For instance, in the former project DARPA TRANSTAC (speech translation from spoken Arabic dialects), a large effort was devoted to the collection of speech transcripts (and a prerequisite to obtain transcripts was often a detailed transcription guide for languages with little standardized spelling). Now, if end-to-end approaches for speech-to-text translation are successful, one might consider collecting data by asking bilingual speakers to directly utter speech in the source language from target language text utterances. Such an approach has the advantage to be applicable to any unwritten (source) language.

研究の動機と目的

ソース言語の transcripts に依存しないエンドツーエンドの speech-to-text translation の研究を動機づける。
attention 機構を用いたテキスト翻訳と音声翻訳のための2つのエンドツーエンドモデルを提案し、比較する。
小さく専門的なコーパスでの学習がエンドツーエンド翻訳に実現可能かを評価する。
合成音声データを用いて、話者間の変動に対する潜在的なロバスト性を示す。

提案手法

テキスト翻訳と音声翻訳の両方に対して、attention-based encoder-decoder neural networks を用いる。
ターゲット系列を生成するために、bidirectional LSTM encoder と attention を備えた2層 LSTM decoder を用いる。
テキスト入力には Bahdanau-style attention 機構を適用し、音声入力には previous attention の記憶を畳み込みフィルターを介して持つ convolutional attention モデルを用いる。
Adam 最適化で学習し、エンコーダとデコーダ層の間に dropout を適用する。
音声モデルには入力系列長を削減する階層型エンコーダを実装し、音声入力には 40 MFCC feature 表現を用いる。
greedy decoding と beam-search decoding で評価し、伝統的な SMT baseline と比較する。

実験結果

リサーチクエスチョン

RQ1ソース言語の transcripts に依存せずに、エンドツーエンドの speech-to-text translation モデルを訓練できるか？
RQ2小さな合成 French–English コーパス上で、エンドツーエンドの speech translation の性能は text translation およびパイプライン SMT baseline と比較してどのようか？
RQ3明示的な speaker adaptation なしに、新しい話者へエンドツーエンドアプローチが一般化するか？
RQ4デコoding 戦略（greedy vs beam search with/without language model）が翻訳品質に与える影響は？

主な発見

エンドツーエンドの speech translation は小さな合成 French–English コーパス上で、いくつかの decoding 設定の下で BLEU スコアが baseline SMT システムと競合する。
言語モデルを用いた5モデルのアンサンブルは、dev および test セットで SMT baseline に近い BLEU スコアを達成する。
訓練に含まれていない新しい話者に対して、音声翻訳モデルは妥当なロバスト性を維持し、speaker adaptation なしでの一般化の可能性を示す。
学習時間は短く（テキストモデル約2時間、音声モデル約8時間、 GTX 1070 上で）、迅速な実験の実現可能性を示している。
エンドツーエンドモデルがアライメントと翻訳を共同で学習できることを、attention alignment 図で可視化して確認する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。