Skip to main content
QUICK REVIEW

[Paper Review] Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation

Ali Can Kocabiyikoglu, Laurent Besacier|arXiv (Cornell University)|Feb 9, 2018
Natural Language Processing Techniques10 references73 citations
TL;DR

The paper augments LibriSpeech with French translations by aligning English LibriSpeech audio to French text, creating a 236-hour bilingual speech-text corpus for direct end-to-end speech translation evaluation and providing a human-validated quality assessment.

ABSTRACT

Recent works in spoken language translation (SLT) have attempted to build end-to-end speech-to-text translation without using source language transcription during learning or decoding. However, while large quantities of parallel texts (such as Europarl, OpenSubtitles) are available for training machine translation systems, there are no large (100h) and open source parallel corpora that include speech in a source language aligned to text in a target language. This paper tries to fill this gap by augmenting an existing (monolingual) corpus: LibriSpeech. This corpus, used for automatic speech recognition, is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. After gathering French e-books corresponding to the English audio-books from LibriSpeech, we align speech segments at the sentence level with their respective translations and obtain 236h of usable parallel data. This paper presents the details of the processing as well as a manual evaluation conducted on a small subset of the corpus. This evaluation shows that the automatic alignments scores are reasonably correlated with the human judgments of the bilingual alignment quality. We believe that this corpus (which is made available online) is useful for replicable experiments in direct speech translation or more general spoken language translation experiments.

Motivation & Objective

  • Fill the gap of large (>100h) open-source parallel corpora with source speech and target text in another language.
  • Leverage LibriSpeech English audio and French e-book translations to create sentence-aligned bilingual data.
  • Evaluate alignment quality with human judgments and correlate with automatic alignment scores.
  • Provide a public dataset to enable replicable end-to-end speech translation experiments.

Proposed method

  • Collect French e-books corresponding to LibriSpeech English books via title translations and public-domain sources.
  • Extract French chapters to match English LibriSpeech chapters to form parallel chapters (1423 chapters from 247 books).
  • Align English-French sentences within chapters using hunAlign with an augmented dictionary (128,000 entries) and preprocessing (tokenization, stemming, reverse stemming).
  • Realign LibriSpeech audio to English sentences using mweralign and Gentle Kaldi-based forced alignment to produce speech with French translations.
  • Provide two French translations per sentence (automatic alignment-based translation and MT) and release data split for speech translation experiments.

Experimental results

Research questions

  • RQ1Can a large-scale, open-source corpus be created by aligning LibriSpeech audio with French translations at the sentence level?
  • RQ2How well do automatic alignment scores (hunAligned) correlate with human judgments of bilingual alignment quality?
  • RQ3Is it feasible to train end-to-end direct speech translation models on this augmented LibriSpeech corpus?
  • RQ4What is the quality and usefulness of the resulting multimodal corpus for direct speech translation evaluation?

Key findings

  • The authors produced ~236 hours of English speech aligned with French translations across 1408 chapters from 247 books.
  • Human evaluation shows average speech alignment score 2.89/3 and bilingual alignment score 3.84/5 for selected chapters, with a Cohen's kappa of 0.76 for annotator agreement.
  • Correlation between human judgments and HunAlign scores is 0.41, suggesting automatic scores reasonably reflect human quality judgments.
  • Automatic cross-language textual similarity methods yield similar correlation with human judgments, supporting use of automatic scores to filter high-quality alignments.
  • The dataset is publicly available and enables end-to-end speech translation experiments, with BLEU around 15 reported in related results.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.