QUICK REVIEW

[论文解读] Chinese Open Instruction Generalist: A Preliminary Release

Ge Zhang, Yemin Shi|arXiv (Cornell University)|Apr 17, 2023

Natural Language Processing Techniques被引用 10

一句话总结

本文提出 COIG，是一个经人工验证的中文指令微调语料项目，跨多个领域聚合约20万条样本，并在 HuggingFace 和 GitHub 上开源发布，以支持中文大模型的指令微调。

ABSTRACT

Instruction tuning is widely recognized as a key technique for building generalist language models, which has attracted the attention of researchers and the public with the release of InstructGPT~\citep{ouyang2022training} and ChatGPT\footnote{\url{https://chat.openai.com/}}. Despite impressive progress in English-oriented large-scale language models (LLMs), it is still under-explored whether English-based foundation LLMs can perform similarly on multilingual tasks compared to English tasks with well-designed instruction tuning and how we can construct the corpora needed for the tuning. To remedy this gap, we propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks. We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality. We also summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora. The resulting extbf{C}hinese extbf{O}pen extbf{I}nstruction extbf{G}eneralist ( extbf{COIG}) corpora are available in Huggingface\footnote{\url{https://huggingface.co/datasets/BAAI/COIG}} and Github\footnote{\url{https://github.com/BAAI-Zlab/COIG}}, and will be continuously updated.

研究动机与目标

解决中文指令微调数据的稀缺性和质量差距。
构建一个规模大、覆盖广、人工验证的中文指令语料库。
为未来的中文指令语料库建设提供领域适配的流水线与指南。
公开开源 COIG 数据以支持中文大模型在商业和非商业用途的使用。

提出的方法

从英语指令语料库中筛选翻译，采用三阶段翻译流程（自动翻译、人工校验、人工修正）。
整合领域特定数据集，包括通用指令、考试指令、人工价值对齐、反事实纠错对话，以及 Leetcode 指令。
实施严格的人类质量检查，进行多阶段验证和专家标注。
使用主动学习标注模板对考试题进行标注，以提取指令、上下文、题目、答案、分析与科目。
基于 CN-DBpedia 构建 CCMC 反事实纠错多轮对话数据集，包含教师与学生之间的五轮对话工作流。

实验结果

研究问题

RQ1如何构建高质量的中文指令微调数据，以更好地使大模型符合中文文化与用法？
RQ2领域自适应流水线（验证、格式、文化、扩展）对中文指令遵循的数据质量和多样性有何影响？
RQ3相比模型生成或仅翻译的数据，人工验证的翻译型语料、考试式数据和价值对齐数据是否能提升中文大模型的指令微调效果？
RQ4开放 COIG 组件以实现社区驱动开发的可行性与影响？

主要发现

COIG 提供了 68k 条通用中文指令、62k 条中文考试指令、3k 条人工价值对齐指令以及 13k 条反事实纠错对话指令作为开源样本。
基于翻译的通用指令语料总计 67,798 条指令，在初步验证中的正确率为 96.63%，经人工修正后为 97.24%。
考试指令被标注以从每道题中提取六个要素，覆盖多科目，在粗粒度类别中历史、政治和生物占比偏高。
Leetcode 指令数据集产生了 11,737 条与编码相关的指令，覆盖 10 种以上编程语言，涵盖代码到文本与文本到代码任务。
作者建议采用三阶段翻译工作流，并强调高质量的人类验证、文化对齐以及领域特定流水线以获得最佳结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。