QUICK REVIEW

[论文解读] Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

G. Ramesh, Sumanth Doddapaneni|arXiv (Cornell University)|Apr 12, 2021

Natural Language Processing Techniques被引用 52

一句话总结

Samanantar 将现有数据与新挖掘的并行数据结合，在11种语言中创建约49.7百万条英语–Indic语言的句对，从而实现最先进的多语种神经机器翻译（IndicTrans）和广泛的跨语言评估。

ABSTRACT

We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the web, resulting in a 4x increase. We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar, which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at https://ai4bharat.iitm.ac.in/samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages.

研究动机与目标

通过汇集现有数据并从多样化来源挖掘新数据，创建一个大规模、公开可用的英语–Indic 语言并行语料库。

提出的方法

汇集来自多个来源的现有英语–Indic 并行数据（例如 OPUS、JW300、Wikipedia、字幕等）。
利用基于 LaBSE 的句子对齐和 LAS 阈值筛选，从可机读来源（新闻网站、教育平台）挖掘额外的并行句子。
通过 OCR（Google Vision）从不可机读来源提取，并与英语对应项使用 LAS 进行对齐。
利用 FAISS 对 LaBSE 向量进行索引，从 IndicCorp 挖掘万级并行数据，以检索近邻并用 LAS 进行筛选。
通过英语中介对 Indic 语言进行枢转，创建 83.4M 条英语–Indic 与 Indic–Indic 句对。
在 Samanantar 上训练多语言 NMT 模型（IndicTrans），通过仔细的数据重叠移除和用于迁移学习的统一的天城文脚本表示。

实验结果

研究问题

RQ1当将现有数据与网络挖掘数据结合时，11种Indic语言的公开可用并行语料库的规模与质量如何？
RQ2在 Indic 语言基准上，基于 Samanantar 训练的多语言 NMT 模型是否优于现有基线和商用系统？
RQ3通过英语枢转在大型多语言语料库中提取高质量的跨 Indic 语言对的影响如何？
RQ4基于 LaBSE 的对齐和 LAS 阈值筛选对挖掘的并行数据质量有何影响？
RQ5哪些数据集和评估协议最能展示 Samanantar 对 Indic 自然语言处理和机器翻译的效用？

主要发现

Samanantar 包含约 49.7 百万条英语–Indic 句对（来自现有来源的 12.4 百万，新增挖掘的 37.4 百万）。
IndicCorp 驱动的挖掘贡献了新数据的 67%，通过英语枢转共挖出 83.4 百万条 En–55 语言对句子。
对 9,566 条句对的人工标注显示 All Accept 和 Definite Accept 类别具有高语义相似度（平均 STS 4.27；Definite Accept 4.63）。
基于 LaBSE 的 LAS 与人工 STS 相关 moderately（Spearman 0.37），并实现对高质量并行数据的有效 LAS 阈值筛选。
通过英语枢转的 Inter-Indic 挖掘产生 83.4M 条 Indic 语言句对，覆盖 55 对语言（11 选 2）。
IndicTrans，在 Samanantar 上训练，超越现有开放模型，并在许多基准测试中甚至超越商用系统，覆盖 10 种 Indic 语言。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。