[논문 리뷰] IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
본 논문은 IndicTrans2를 도입하고, 230M bitexts의 Bharat Parallel Corpus Collection(BPCC)을 공개하며(그 중 126M 신규, 수동 번역 644K 포함), 22개 예정된 인도어를 모두 아우르는 최초의 n-way 벤치마크를 제공하고, 개방형 다중언어 MT 모델도 제공한다.
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/AI4Bharat/IndicTrans2.
연구 동기 및 목표
- 모든 22개 예정 인도어에 대한 대규모 병렬 데이터의 부족 문제를 해결한다.
- 인도 관련 콘텐츠를 포괄하는 다양하고 고품질의 벤치마크를 생성한다.
- 모든 22개 언어를 지원하는 다국어 MT 모델을 개발한다.
- 광범위한 활용을 가능하게 하기 위해 데이터, 모델 및 벤치마크를 관대한 오픈 액세스 라이선스 하에 공개한다.
제안 방법
- Bharat Parallel Corpus Collection (BPCC) 공개: 230M bitexts, with 126M newly added, including 644K manually translated sentence pairs.
- Create the first n-way parallel benchmark spanning all 22 Indian languages with diverse domains and source-original test sets.
- Develop IndicTrans2, a multilingual MT model that supports all 22 scheduled languages and surpasses existing models on multiple benchmarks.
- Provide open-access releases of models and data under permissive licenses to facilitate research and deployment.
실험 결과
연구 질문
- RQ1Can a single multilingual model effectively cover all 22 scheduled Indian languages?
- RQ2How does IndicTrans2 perform on newly created 22-language benchmarks compared with prior models?
- RQ3What impact does the expanded BPCC data have on translation quality across the 22 languages?
- RQ4How do open-access data and models influence accessibility and collaboration in Indic MT research?
주요 결과
- BPCC comprises 230M bitext pairs, with 126M newly added, including 644K manually translated pairs.
- First n-way benchmark covers all 22 Indian languages with diverse domains and India-origin content.
- IndicTrans2 is the first model to support all 22 languages and surpasses existing models on multiple benchmarks created for this work.
- The authors release their models and data with permissive licenses to promote accessibility and collaboration.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.