QUICK REVIEW

[论文解读] Qwen2.5-1M Technical Report

Yang An, B. X. Yu|ArXiv.org|Jan 26, 2025

Solidification and crystal growth phenomena被引用 7

一句话总结

Qwen2.5-1M 将上下文长度扩展到 1M 个令牌，引入长上下文预训练、后训练，以及具备长度插值、稀疏注意力和系统级优化的开源推理框架，为超长输入带来显著的预填充加速。

ABSTRACT

We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.

研究动机与目标

推动并实现超出 128k 令牌的 LLMs 长上下文处理。
开发高效的长上下文预训练和后训练策略，在提升长程推理能力的同时保持短上下文性能。
提供一个开放源码的推理框架，具备长度外推、稀疏注意力和引擎优化，以降低成本并加速部署。

提出的方法

带有合成数据的长上下文预训练和长距离依赖的训练步骤，以改善长距离依赖。
使用合成的长指令数据进行后训练，以及两阶段的监督微调再加离线强化学习。
一个开源推理框架，具备长度外推（通过 Dual Chunk Attention 与 YaRN 注意力缩放）、利用 MInference 的稀疏注意力，以及分块预填充优化；引擎优化包括内核、流水线和调度方面的改进。

实验结果

研究问题

RQ1如何在保持或提高短上下文性能的同时，有效将 LLM 的上下文扩展至 1M 令牌？
RQ2哪些数据和训练策略最能促进 Qwen2.5-1M 的长程依赖？
RQ3如何通过外推、稀疏注意力和引擎优化使超长上下文的推理成本更低、可扩展？
RQ4长度外推和稀疏性改进对长上下文检索和问答任务的影响？
RQ5Qwen2.5-1M 模型在长上下文基准测试上与现有的 1M 上下文替代方案相比如何？

主要发现

在使用开源推理框架时，Qwen2.5-1M 模型实现 1M 上下文处理，并在预填充阶段获得显著加速（3x 到 7x 的预填充加速）。
以递进的上下文长度（最高至 262,144 令牌）和合成长数据任务进行的长上下文训练提升了长上下文理解能力，同时不牺牲短上下文性能。
两阶段后训练加离线强化学习提升了与人类偏好的对齐，并对长上下文任务具有泛化，在 RL 之后对 Longbench-Chat 也有可量化的收益。
通过 Dual Chunk Attention (DCA) 和 YaRN 注意力缩放的长度外推在如 Passkey Retrieval 和 NIAH 的长上下文任务上显著提升性能，上下文达到 1M 令牌。
基于 MInference 的稀疏注意力、分块预填充和稀疏性优化在 Needle in a Haystack 测试中，针对 1M 令牌上下文恢复了大多数检索精度，同时实现了显著的加速。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。