QUICK REVIEW

[论文解读] L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Ravindra Nayak, Raviraj Joshi|arXiv (Cornell University)|Apr 18, 2022

Natural Language Processing Techniques被引用 21

一句话总结

本文介绍了 L3Cube-HingCorpus，一个大型真实印地语-英语混合代码语料库，以及在其上预训练的 HingBERT 系列模型，显示在 GLUECoS 任务上的改进；同时发布 HingLID 和 HingGPT 资源。

ABSTRACT

Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative transformer model capable of generating full tweets. We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed data using the process outlined in this work. The dataset and models are available at https://github.com/l3cube-pune/code-mixed-nlp .

研究动机与目标

激励并构建一个以罗马字母书写的大规模真实印地语-英语混合代码语言语料，以解决真实混合代码数据的稀缺。
在 HingCorpus 上预训练基于 BERT 的模型，创建 HingBERT、HingMBERT 和 HingRoBERTa，用于代码混合 NLP。
在 GLUECoS 基准中的下游代码混合任务（LID、POS 标注、NER、情感）上评估这些模型。
发布配套资源（HingLID、HingGPT、HingFT），以支持代码混合印地语-英语 NLP 研究。

提出的方法

使用针对性的印地语-英语词汇表抓取 Twitter 数据，以构建以罗马字母书写的 HingCorpus。
使用词级 LID 模型筛选和分类单词；保留至少包含 2 个印地语词和 2 个英语词的句子。
在 HingCorpus 上使用 MLM（15% 掩码）对 BERT 的变体（BERT-base、m-BERT、XLM-RoBERTa）进行预训练 2 个时期；产生 HingBERT、HingMBERT、HingRoBERTa。
在 GLUECoS 的下游任务上对预训练模型进行微调；使用 [CLS] 向量及前馈头进行分类任务。
创建罗马字母脚本和混合脚本（罗马字+天城体 Devanagari）版本的模型；在混合脚本 Devanagari 任务上评估。

实验结果

研究问题

RQ1一个大规模的真实 Hinglish 语料是否能在代码混合语言理解方面优于在单语或合成数据上训练的模型？
RQ2HingBERT 系列模型是否在代码混合的 LID、POS、NER、情感任务上优于基线 BERT 变体？
RQ3混合脚本训练（罗马字+Devanagari）对下游代码混合任务有何影响？
RQ4发布的资源（LID 语料、HingBERT-LID、HingGPT）对扩展 Hinglish NLP 研究有多大帮助？

主要发现

Model	LID	POS-UD	POS-FG	NER	Sentiment	HingLID
BERT	78.69	83.70	70.75	79.27	59.16	96.04
m-BERT	82.56	83.68	69.58	76.64	58.42	95.59
XLMRoBERTa	85.93	87.24	70.95	77.01	61.57	95.42
HingBERT	84.44	88.42	71.04	81.80	63.72	96.21
HingMBERT	84.90	89.47	71.55	80.09	63.51	96.27
HingRoBERTa	86.69	90.17	71.69	81.13	66.43	96.15
HingMBERT-mixed	83.26	90.06	70.34	81.12	63.51	96.29
HingRoBERTa-mixed	86.13	89.87	70.73	80.68	66.73	95.96

HingBERT 系列模型在以罗马字母为脚本的代码混合任务中取得的 F1 和准确率高于基线 BERT 变体。
基于 XLM-RoBERTa 的 HingRoBERTa 通常在大多数罗马脚本任务中表现最佳，达到多项指标的 SOTA 风格结果。
混合脚本的 HingBERT-mixed 与 HingRoBERTa-mixed 在混合脚本 Devanagari+罗马脚本任务上提升了性能，尽管并非在所有任务上都普遍提升。
在纯罗马脚本任务中，罗马脚本模型优于混合脚本模型；而在混合脚本评估中，混合脚本模型表现出色。
HingBERT-LID 在 HingLID 测试集上达到 98.77（公开发布），基于 HingLID 的模型可实现对 Hinglish 数据的扩展增强。
HingGPT 提供一个基于 HingCorpus 训练的 GPT-2 风格生成器，能够生成完整推文，支持合成代码混合数据的生成。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。