QUICK REVIEW

[论文解读] Indic-Transformers: An Analysis of Transformer Language Models for Indian Languages

Kushal Jain, Adwait Deshpande|arXiv (Cornell University)|Nov 4, 2020

Topic Modeling被引用 24

一句话总结

本文提出并评估了专为印地语、孟加拉语和泰卢固语微调的单语Transformer语言模型——BERT、DistilBERT、RoBERTa和XLM-RoBERTa，在文本分类任务中取得了最先进（SOTA）的性能。研究对比了微调完整模型与将其作为特征提取器配合下游分类器的策略，表明即使在数据有限的情况下也能实现具有竞争力的性能，并向社区发布了模型检查点和一个合并的问答（QA）数据集。

ABSTRACT

Language models based on the Transformer architecture have achieved state-of-the-art performance on a wide range of NLP tasks such as text classification, question-answering, and token classification. However, this performance is usually tested and reported on high-resource languages, like English, French, Spanish, and German. Indian languages, on the other hand, are underrepresented in such benchmarks. Despite some Indian languages being included in training multilingual Transformer models, they have not been the primary focus of such work. In order to evaluate the performance on Indian languages specifically, we analyze these language models through extensive experiments on multiple downstream tasks in Hindi, Bengali, and Telugu language. Here, we compare the efficacy of fine-tuning model parameters of pre-trained models against that of training a language model from scratch. Moreover, we empirically argue against the strict dependency between the dataset size and model performance, but rather encourage task-specific model and method selection. We achieve state-of-the-art performance on Hindi and Bengali languages for text classification task. Finally, we present effective strategies for handling the modeling of Indian languages and we release our model checkpoints for the community : https://huggingface.co/neuralspace-reverie.

研究动机与目标

通过为印地语、孟加拉语和泰卢固语训练并评估单语Transformer模型，解决印度语言在自然语言处理研究中代表性不足的问题。
比较微调完整预训练模型与使用它们作为任务特定头的特征提取器的性能。
探究在低资源印度语言设置下，数据集大小是否严格决定模型性能。
发布训练好的模型检查点和一个合并的问答数据集（mergedQuAD），以支持未来在印度语言自然语言处理领域的研究。

提出的方法

在大规模单语文本上为印地语、孟加拉语和泰卢固语训练了四种单语Transformer变体——BERT、DistilBERT、RoBERTa和XLM-RoBERTa。
通过三种实验设置（数据量和微调策略不同）在三个下游任务上评估模型：词性标注、文本分类和问答。
将单语模型与多语言模型（如mBERT、XLM-RoBERTa）进行对比，以评估性能增益的相对大小。
在上下文嵌入之上使用不同的神经头（LSTM、BiLSTM、前馈网络和Transformer），以评估特征提取的有效性。
在RoBERTa中采用字节级BPE分词，并分析其对性能的影响，特别是在问答任务中的表现。
在Hugging Face上发布了模型检查点，并开源了mergedQuAD，即XQuAD和MLQA数据集的合并版本，用于印地语。

实验结果

研究问题

RQ1从零开始训练单语Transformer模型是否在印度语言上比使用多语言模型表现更好？
RQ2在低资源印度语言设置下，数据集大小与下游任务性能的相关性有多大？
RQ3使用预训练Transformer作为特征提取器，并搭配轻量级头（如LSTM）是否能实现与完整微调相媲美的性能？
RQ4分词器选择（如字节级BPE）对印度语言模型性能有何影响，特别是在问答任务中？
RQ5将多个数据集（如XQuAD和MLQA）合并后，对印地语问答模型的训练和评估有何影响？

主要发现

在设置C中，作者使用其单语模型在印地语和孟加拉语文本分类任务中取得了最先进性能，优于现有基线模型。
在问答任务中，模型未超越TyDiQA黄金段落基线，后者是在完整多语言数据集上训练的，表明跨语言迁移具有显著优势。
单语模型相较于多语言模型仅表现出微弱的性能提升，表明在某些任务中多语言模型可能已足够。
使用Transformer作为特征提取器并搭配LSTM头可获得具有竞争力的结果，尤其在全量微调因资源限制而不可行时。
分词器的选择，特别是RoBERTa中的字节级BPE，对性能有可测量的影响，尤其在问答任务中表现明显。
尽管单语训练数据较小，但泰卢固语模型在问答任务中表现良好，表明任务特定数据集大小可能比单语语料库大小更为关键。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。