QUICK REVIEW

[论文解读] AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Anoop Kunchukuttan, Divyanshu Kakwani|arXiv (Cornell University)|Apr 30, 2020

Natural Language Processing Techniques参考文献 23被引用 43

一句话总结

本文引入 IndicNLP 语料库，覆盖 10 种印度语言的 27 亿词，提供预训练的 FastText 嵌入，并在新闻类别分类、词语相似性/类比、双语词汇诱导等基线任务上对比公开基线，显示出改进。

ABSTRACT

We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

研究动机与目标

为 10 种 Indic 语言创建大规模单语语料库，反映当代用法。
提供在 IndicNLP 语料库上训练的预训练词嵌入。
开发下游评估数据集（新闻类别分类）和无监督形态学分析器。
证明 IndicNLP 的嵌入在多种 NLP 任务上优于公开可用的嵌入。

提出的方法

从新闻源和维基百科收集并对通用领域的单语数据进行预处理。
对 Indic 文本进行标准化、句子分割，并使用 Indic NLP Library 进行分词。
在每种语言上训练 300 维 FastText skip-gram 嵌入，包含子词信息（10 次训练轮次，window=5，min count=5，10 个负采样）。
在词语相似性、词语类比、情感/文本分类，以及双语词汇诱导（BLI）等任务上评估嵌入。
为 9 种语言构建 IndicNLP News Category Dataset，并使用 k-NN（k=4）结合平均词向量进行分类。
训练无监督 Morpheme 分析器（Morfessor 2.0），评估形态学相关改进对印度语言 SMT 的影响。

实验结果

研究问题

RQ1IndicNLP 嵌入在内在任务与外在任务上是否优于公开基线（FT-W、FT-WC）？
RQ2单语言 IndicNLP 语料对词语相似性、类比、情感、文本分类和双语词汇诱导的影响如何？
RQ3IndicNLP 资源是否支持无监督形态分析并改善跨语言 SMT？
RQ4该语料库在构建多语言表征与下游 NLP 基准方面的实用性与价值如何？

主要发现

Lang	FT-W	FT-WC	INLP
pa	94.23	94.87	96.79
bn	97.00	97.07	97.86
or	94.00	95.93	98.07
gu	97.05	97.54	99.02
mr	96.44	97.07	99.37
kn	96.13	96.50	97.20
te	98.46	98.17	98.79
ml	90.00	89.33	92.50
ta	95.98	95.81	97.01
Average	95.47	95.81	97.40

IndicNLP 嵌입在多项任务上优于两个公开基线；平均词相似性（Pearson）跨语言提升至 0.519（INLP）对比 0.507（FT-W）和 0.497（FT-WC）。
在词语类比（Hindi 子集）中，IndicNLP 准确率为 33.48% ，而 FT-W 为 19.76%、FT-WC 为 32.93%。
在多样化公开数据集的文本分类任务中，IndicNLP 嵌入显示更高的准确率（平均 74.73%），对比 FT-W 的 69.25% 和 FT-WC 的 68.32%。
IndicNLP News Category Dataset 的结果在各语言上使用 INLP 嵌入时表现更高（例如 pa: 96.79, bn: 97.86, or: 98.07, gu: 99.02, mr: 99.37, te: 98.79, ta: 97.01 等，总体平均 97.40）。
双语词汇诱导（BLI）使用 GeoMM 时 INLP 的平均准确率更高：en→Indic 36.55，Indic→en 44.94（对比 FT-W 25.98/33.20 与 FT-WC 32.88/44.94）。
在 IndicNLP 上训练的无监督形态分析器在 SMT 的 BLEU 指标上优于基于词级的基线，并且在与早期形态分析器（K&B，2016）的对比中具备竞争力（平均 BLEU：word 22.84，morph 24.21，morph（K&B，2016） 24.57）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。