QUICK REVIEW

[論文レビュー] GujiBERT and GujiGPT: Construction of Intelligent Information Processing Foundation Language Models for Ancient Texts

Dongbo Wang, Chang Liu|arXiv (Cornell University)|Jul 11, 2023

Computational and Text Analysis Methods被引用数 11

ひとこと要約

本論文はGujiBERTとGujiGPTを提示し、古代中国語テキストの知的処理のために特化した基盤言語モデルを提案し、分割から翻訳までのタスクをカバーします。

ABSTRACT

In the context of the rapid development of large language models, we have meticulously trained and introduced the GujiBERT and GujiGPT language models, which are foundational models specifically designed for intelligent information processing of ancient texts. These models have been trained on an extensive dataset that encompasses both simplified and traditional Chinese characters, allowing them to effectively handle various natural language processing tasks related to ancient books, including but not limited to automatic sentence segmentation, punctuation, word segmentation, part-of-speech tagging, entity recognition, and automatic translation. Notably, these models have exhibited exceptional performance across a range of validation tasks using publicly available datasets. Our research findings highlight the efficacy of employing self-supervised methods to further train the models using classical text corpora, thus enhancing their capability to tackle downstream tasks. Moreover, it is worth emphasizing that the choice of font, the scale of the corpus, and the initial model selection all exert significant influence over the ultimate experimental outcomes. To cater to the diverse text processing preferences of researchers in digital humanities and linguistics, we have developed three distinct categories comprising a total of nine model variations. We believe that by sharing these foundational language models specialized in the domain of ancient texts, we can facilitate the intelligent processing and scholarly exploration of ancient literary works and, consequently, contribute to the global dissemination of China's rich and esteemed traditional culture in this new era.

研究の動機と目的

古代テキストのデジタル人文学および言語研究を支えるための専門的なLMの開発を促進する。
簡体字と繁体字の両方を扱えるモデルを作成する。
文分割、句読点付け、単語分割、品詞タグ付け、実体認識、翻訳といったタスクでの性能を示す。

提案手法

GujiBERTとGujiGPTを古代/古典中国語と現代中国語の大規模コーパス（簡体字・繁体字を含む）で学習させる。
自動文分割、句読点付け、単語分割、品詞タグ付け、実体認識、自動翻訳を含む複数のNLPタスクで評価する。
古典テキストコーパスを用いた自己教師付き refinement を適用し、下流タスクの性能を向上させる。
字体選択、コーパス規模、初期モデル選択が成果に与える影響を検討する。
デジタル人文学と言語学の研究者の多様なニーズに対応するため、3カテゴリーと9モデルバリエーションを提供する。

実験結果

リサーチクエスチョン

RQ1古代テキストの知的情報処理に特化した基盤言語モデルをどう設計すべきか。
RQ2字体、コーパス規模、初期モデル選択が古代中国語 NLPタスクの性能に与える影響はどれか。
RQ3古典コーパスでの自己教師付き微調整は分割、タグ付け、NER、翻訳といった下流タスクを改善できるか。
RQ4複数のモデル variante はデジタル人文学と linguistics における多様なユーザーニーズを満たすか。

主な発見

GujiBERTとGujiGPTは古代テキスト処理の幅広いタスクで高い性能を示す。
古典コーパスでの自己教師付き訓練は下流タスクの能力を高める。
字体、コーパス規模、初期モデル選択は実験結果に大きな影響を与える。
3カテゴリーと9モデルバリエーションは、異なる好みを持つ研究者に柔軟性を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。