QUICK REVIEW

[論文レビュー] L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Ravindra Nayak, Raviraj Joshi|arXiv (Cornell University)|Apr 18, 2022

Natural Language Processing Techniques被引用数 21

ひとこと要約

この論文はL3Cube-HingCorpusを提案し、大規模な実デ Hindi-English code-mixed コーパスと HingBERT ファミリーモデルを紹介。GLUECoSタスクでの改善を示し、 HingLID と HingGPT リソースも公開。

ABSTRACT

Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative transformer model capable of generating full tweets. We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed data using the process outlined in this work. The dataset and models are available at https://github.com/l3cube-pune/code-mixed-nlp .

研究の動機と目的

ローマ字表記での実際のヒンディー語-英語コード混在データの希少性を解消するため、巨大なリアルコード混合コーパスを動機づけ、構築する。
HingCorpus 上で BERT 系モデルを事前学習し、コード混合 NLP のための HingBERT, HingMBERT, HingRoBERTa を作成する。
GLUECoS ベンチマークの下流タスク（LID、POS tagging、NER、Sentiment）でモデルを評価する。
HingLID、 HingGPT、 HingFT などの付随リソースを公開して、ヒンディー語-英語のコード混合NLP研究を支援する。

提案手法

ヒンディー語-英語のターゲット語彙を用いて Twitter データをスクレープし、Roman script で HingCorpus を構築する。
語レベルの LID モデルで語をフィルタリング・分類し、少なくとも 2 語のヒンディー語と 2 語の英語を含む文を保持する。
HingCorpus 上で BERT のバリアント（BERT-base, m-BERT, XLM-RoBERTa）を MLM（15% マスキング）で2エポック事前学習し、 HingBERT, HingMBERT, HingRoBERTa を作成する。
事前学習モデルを GLUECoS の下流タスクでファインチューニングする；分類タスクには [CLS] 埋め込みとフィードフォワードヘッドを使用する。
ローマン・スクリプト版と混合スクリプト版（Roman+Devanagari）のモデルを作成し、混合スクリプト Devanagari タスクで評価する。

実験結果

リサーチクエスチョン

RQ1大規模なリアルな Hinglish コーパスは、モノリンガルまたは合成データで訓練されたモデルと比較してコード混在言語理解を改善できるか？
RQ2 HingBERT-ファミリーモデルは、コード混在 LID、POS、NER、感情分析タスクでベースライン BERT 変種を上回るか？
RQ3 混合スクリプト訓練（Roman+Devanagari）の下流のコード混在タスクへの影響は？
RQ4 公開リソース（LID コーパス、 HingBERT-LID、 HingGPT）が Hinglish NLP 研究を拡張するのにどれだけ効果的か？

主な発見

Model	LID	POS-UD	POS-FG	NER	Sentiment	HingLID
BERT	78.69	83.70	70.75	79.27	59.16	96.04
m-BERT	82.56	83.68	69.58	76.64	58.42	95.59
XLMRoBERTa	85.93	87.24	70.95	77.01	61.57	95.42
HingBERT	84.44	88.42	71.04	81.80	63.72	96.21
HingMBERT	84.90	89.47	71.55	80.09	63.51	96.27
HingRoBERTa	86.69	90.17	71.69	81.13	66.43	96.15
HingMBERT-mixed	83.26	90.06	70.34	81.12	63.51	96.29
HingRoBERTa-mixed	86.13	89.87	70.73	80.68	66.73	95.96

HingBERT-family models achieve higher F1 and accuracy on code-mixed tasks than baseline BERT variants in Roman script.
XLM-RoBERTa-based HingRoBERTa generally performs best among the HingBERT variants on most roman-script tasks, achieving SOTA-like results on several metrics.
Mixed-script HingBERT-mixed and HingRoBERTa-mixed improve performance on mixed-script Devanagari+Roman tasks, though not universally across all tasks.
The roman-script models outperform their mixed-script counterparts on pure roman-script tasks, while mixed-script models excel on mixed-script evaluation.
HingBERT-LID achieves 98.77 on HingLID test set (public release) and HingLID-based models enable expansive Hinglish data augmentation.
HingGPT provides a GPT-2 based generator trained on HingCorpus capable of producing full tweets, supporting synthetic code-mixed data generation.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。