QUICK REVIEW

[論文レビュー] IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus

Honghao Gui, Lin Yuan|arXiv (Cornell University)|Feb 22, 2024

Topic Modeling被引用数 6

ひとこと要約

IEPile は、情報抽出のための英中バイリンガル instruction コーパスを33の既存IEデータセットから構築し、総計約0.32B tokens、スキーマベースの instruction 生成により、LLMのゼロショット情報抽出性能を向上させる。

ABSTRACT

Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.

研究の動機と目的

LLMs が IE タスクで性能ギャップを埋めるために、大規模で標準化された IE instruction データの必要性を動機づける。
既存データセットから包括的なバイリンガル IE instruction コーパスを構築し、スケーラブルな IE 学習を可能にする。
IE タスクにおけるスキーマ-クエリの乖離と意味的混乱に対処するため、スキーマベースの instruction 生成を開発する。
IEPile でファインチューニングした LLM が英語および中国語データセットのゼロショット IE 性能を向上させることを示す。

提案手法

英語と中国語にまたがる 33 の既存の IE データセットを収集し、整備する。
形式を標準化し、インスタンスを重複除去し、低品質データをフィルタする。
instruction 内で意味的に似たネガティブスキーマを強調するため、hard negative schema 構築を導入している。
バッチ処理による instruction 生成を適用して、1つの命令につき照会されるスキーマの数を制限・多様化する。
IEPile で Baichuan2 および LLaMA2 モデルをファインチューニングして、ゼロショット IE 性能を評価する。

実験結果

リサーチクエスチョン

RQ1大規模でスキーマ対応のバイリンガル IE コーパスは、ゼロショット設定において特に、LLM ベースの情報抽出をどのように改善できるか。
RQ2スキーマベースの instruction 戦略（hard negatives と batching）がモデルの一般化と頑健性に与える影響は何か。
RQ3IEPile で訓練したモデルは、英語および中国語の IE タスクにおいてゼロショット設定でベースラインを上回ることができるか。

主な発見

IEPile は、英語・中国語の両方で NER、RE、EE タスクにおいて、いくつかのベースラインと比較してゼロショット IE 性能を改善する。
hard negative schema 構築と batched instruction 生成は、訓練と評価のスキーマ-クエリの不一致を緩和し、意味的混乱を低減するのに役立つ。
Baichuan2-IEPile および LLaMA2-IEPile は、いくつかの設定で ChatGPT に近接する英語 NER 精度を示し、コーパスによる強いゼロショット一般化を示唆している。
実験では hard negative schema 辞書を削除すると、意味的に混乱したスキーマでの性能が低下し、その頑健性の価値を浮き彫りにしている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。