QUICK REVIEW

[论文解读] Qwen Technical Report

Jinze Bai, Shuai Bai|arXiv (Cornell University)|Sep 28, 2023

Topic Modeling被引用 80

一句话总结

QWEN 引入一个开源大语言模型家族（基础、聊天，以及专门的 CODE-QWEN 和 MATH-QWEN-CHAT），在数万亿个标记上进行预训练，通过 SFT 和 RLHF 对齐，具备工具使用和代码解释器能力，并在 14B 和 7B 规模开源。

ABSTRACT

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.

研究动机与目标

演示 QWEN 基础模型和对齐聊天模型在多样化下游任务中的效果。
展示监督学习和来自人类反馈的强化学习对模型对齐的影响。
介绍专门的编码与数学模型（CODE-QWEN、CODE-QWEN-CHAT、MATH-QWEN-CHAT）及其性能。
向研究界开源 14B 和 7B 参数的基础和聊天模型。

提出的方法

对 QWEN 进行高达 3 兆标记的自回归预训练，使用多样化、可翻译的数据集。
数据预处理包括去重、质量过滤，以及对高质量来源的上采样。
使用 BPE 进行分词，词汇表为 152K，针对中文及多语言覆盖进行增强。
架构选择包括未绑定的嵌入、RoPE 位置编码、QKV 偏置配置、RMSNorm 与 SwiGLU 激活。
通过推理时的 NTK 感知插值、LogN-Scaling，以及逐层窗口注意力实现上下文长度扩展以处理长上下文。
通过以 ChatML 风格对话进行监督微调和使用奖励模型与 PPO 优化的 RLHF 进行对齐。

实验结果

研究问题

RQ1基础 QWEN 模型在相较于开源基线的多任务标准基准上表现如何？
RQ2对齐（SFT 和 RLHF）对聊天模型的性能和人类偏好回答有何影响？
RQ3用于编码的专门模型（CODE-QWEN）和用于数学的模型（MATH-QWEN-CHAT）是否在各自领域超越开源对手？
RQ4上下文长度扩展技术对长上下文理解和困惑度有何影响？
RQ5开源 QWEN 模型在零-shot 和少-shot 设置下与专有基线相比如何？

主要发现

QWEN-14B 在多个基准测试上超越先前的 13B SOTA 模型，在语言、知识和推理任务上均表现出色。
QWEN-CHAT 模型通过 RLHF 对齐，在基准测试中具有高度竞争力，接近 GPT-4，在某些测试上仍有领先。
专门的 CODE-QWEN 和 CODE-QWEN-CHAT 在 HumanEval、MBPP 及相关任务上实现了较高的代码理解与生成，优于开源同类。
MATH-QWEN-CHAT 模型（7B 和 14B）优于同尺寸的开源数学模型，并在 GSM8K 与 MATH 数据集上接近 GPT-3.5 的水平。
上下文长度扩展技术（NTK 感知插值、LogN-Scaling、逐层窗口化）在 8192 标记及以上长度下仍能有效维持性能。
QWEN-VL 与 QWEN-VL-CHAT 展示了在先前工作中的卓越视觉-语言能力，且开源版本已并入该系列。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。