[论文解读] IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
该论文介绍 IndicTrans2,发布 Bharat Parallel Corpus Collection (BPCC),包含 230M 双语文本对(其中 126M 新增,包括 644K 手动翻译对),并提供首次覆盖全部 22 个计划中的印度语言的 n-way 基准,以及开放获取的多语言 MT 模型。
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/AI4Bharat/IndicTrans2.
研究动机与目标
- 解决所有 22 个计划中的印度语言缺乏大规模平行数据的问题。
- 创建覆盖印度相关内容的多样化、高质量的基准。
- 开发一个支持所有 22 种语言的多语言 MT 模型。
- 在宽松的开放获取许可下发布数据、模型和基准,以促进广泛使用。
提出的方法
- 发布 Bharat Parallel Corpus Collection (BPCC):230M 双语文本对,其中 126M 新增,包括 644K 手动翻译的句对。
- 创建覆盖全部 22 种印度语言、跨越多样领域且包含自源测试集的第一份 n 维并行基准。
- 开发 IndicTrans2,一种支持所有 22 种计划语言的多语言 MT 模型,并在多项基准上超越现有模型。
- 在宽松许可下提供模型与数据的开放获取版本,以促进研究与部署。
实验结果
研究问题
- RQ1一个单一的多语言模型是否能够有效覆盖所有 22 种计划中的印度语言?
- RQ2与之前的模型相比,IndicTrans2 在新创建的 22 语言基准上的表现如何?
- RQ3扩展后的 BPCC 数据对这 22 种语言的翻译质量有何影响?
- RQ4开放获取的数据与模型如何影响 Indic MT 研究的可及性与协作?
主要发现
- BPCC 拥有 230M 对双语文本对,其中 126M 为新增加的,包括 644K 手动翻译对。
- 第一份 n-way 基准覆盖所有 22 种印度语言,涵盖多样领域与印度源内容。
- IndicTrans2 是第一款支持所有 22 种语言的模型,并在为本工作创建的多项基准上超越现有模型。
- 作者以宽松许可发布其模型和数据,以促进获取性与协作。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。