Skip to main content
QUICK REVIEW

[論文レビュー] Chinese Open Instruction Generalist: A Preliminary Release

Ge Zhang, Yemin Shi|arXiv (Cornell University)|Apr 17, 2023
Natural Language Processing Techniques被引用数 10
ひとこと要約

tldr: The paper presents COIG, a manually verified Chinese instruction tuning corpus project that aggregates ~200k samples across multiple domains, with open-source releases on HuggingFace and GitHub to support Chinese LLM instruction tuning.

ABSTRACT

Instruction tuning is widely recognized as a key technique for building generalist language models, which has attracted the attention of researchers and the public with the release of InstructGPT~\citep{ouyang2022training} and ChatGPT\footnote{\url{https://chat.openai.com/}}. Despite impressive progress in English-oriented large-scale language models (LLMs), it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora. The resulting extbf{C}hinese extbf{O}pen extbf{I}nstruction extbf{G}eneralist ( extbf{COIG}) corpora are available in Huggingface\footnote{\url{https://huggingface.co/datasets/BAAI/COIG}} and Github\footnote{\url{https://github.com/BAAI-Zlab/COIG}}, and will be continuously updated.

研究の動機と目的

  • Address scarcity and quality gaps in Chinese instruction tuning data.
  • Construct a large, diverse, manually verified Chinese instruction corpus.
  • Provide domain-adapted pipelines and guidance for future Chinese instruction corpus construction.
  • Release open-source COIG data to support Chinese LLMs in both commercial and non-commercial use.

提案手法

  • Curate translations from English instruction corpora with a three-phase translation pipeline (automatic translation, manual verification, manual correction).
  • Assemble domain-specific datasets including general instruction, exam instructions, human value alignment, counterfactual correction chats, and Leetcode instructions.
  • Implement strict human quality checks with multi-phase verification and expert annotators.
  • Annotate exam questions using an active-learning annotation template to extract instruction, context, question, answer, analysis, and subject.
  • Construct the CCMC counterfactual correction multi-round chat dataset based on CN-DBpedia with a five-round chat workflow between a teacher and a student.]
  • research_questions:[

実験結果

リサーチクエスチョン

  • RQ1How can high-quality Chinese instruction tuning data be constructed to better align LLMs with Chinese culture and usage?
  • RQ2What are the effects of domain-adapted pipelines (verification, format, culture, scaling) on data quality and diversity for Chinese instruction following?
  • RQ3Can manually verified translation-based corpora, exam-style data, and value-alignment data improve Chinese LLM instruction tuning compared to model-generated or translated-only data?
  • RQ4What is the feasibility and impact of releasing open COIG components for community-driven development?

主な発見

  • COIG provides 68k general Chinese instructions, 62k Chinese exam instructions, 3k human-value alignment instructions, and 13k counterfactual correction chat instructions as open samples.
  • A translation-based general instruction corpus totaling 67,798 instructions was built with a 96.63% correctness rate in initial verification and 97.24% after manual correction.
  • Exam instructions are annotated to extract six elements per question, spanning multiple subjects with a distribution favoring History, Politics, and Biology among coarse-grained categories.
  • The Leetcode Instructions dataset yields 11,737 code-related instructions across 10+ programming languages, including both code-to-text and text-to-code tasks.
  • The authors recommend a three-phase translation workflow and emphasize high-quality human verification, cultural alignment, and domain-specific pipelines for best results.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。