QUICK REVIEW

[논문 리뷰] Qwen3 Technical Report

An Yang, Anfeng Li|ArXiv.org|2025. 05. 14.

Topic Modeling인용 수 44

한 줄 요약

Qwen3는 사고 모드와 비사고 모드를 포함한 오픈-웨이트 Dense 및 MoE LLM을 최대 235B 파라미터까지 도입하고, 사고 예산과 119개 언어의 다국어 지원을 제공하며, 사후 학습 증류를 통해 더 작은 모델의 성능을 강화하면서도 강력한 벤치마크를 달성합니다.

ABSTRACT

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.

연구 동기 및 목표

오픈-웨이트 대형 언어 모델(LLMs)을 Dense와 MoE 아키텍처로 발전시킨다.
단일 모델 내에 통합된 사고 모드와 비사고 모드를 적용해 전문 시스템 간 전환을 피한다.
추론 중 사고 깊이와 계산 사이의 균형을 맞추는 사고 예산을 도입한다.
다국어 지원을 119개 언어로 확장해 교차 언어 이해 및 생성을 향상시킨다.

제안 방법

36조 토큰에 걸친 119개 언어에서 Dense 및 MoE Qwen3 모델(0.6B–235B 파라미터)을 3단계 사전 학습(일반, 추론, 긴 컨텍스트) 및 긴 컨텍스트 기법으로 사전 학습한다.
Grouped Query Attention, SwiGLU, RoPE, 및 QK-Norm이 포함된 RMSNorm을 도입하고 활성화된 8개 전문가를 포함하는 128-전문가 MoE 설계와 전문화 촉진을 위한 글로벌 밸런스 손실을 적용한다.
하나의 모델 내 두 가지 모드 사고(Thinking)와 비사고(Non-Thinking) 프레임워크를 채택하고, 사고 깊이를 제어하는 사고 예산과 동적 모드 전환을 위한 채팅 템플릿 시스템을 도입한다.
두 가지 사고(Thinking/Thinking) 및 비사고(Non-Thinking) 모드를 모두 포함하는 사전 학습 이후, 더 큰 교사 모델로부터 작은 모델이 능력을 상속받도록 하는 강력-약한 증류를 통해 네 단계의 사후 학습(사고 두 단계, 비사고 두 단계)을 구현한다.
롱-CoT 콜드 스타트 데이터, 3,995개의 질의-검증 쌍으로 구성된 추론 RL(GRPO) 및 사고 모드 융합 단계를 개발해 사고와 비사고 능력을 병합한다.

실험 결과

연구 질문

RQ1오픈-웨이트 Qwen3의 Dense 및 MoE 모델이 일반 벤치마크, 수학/과학 STEM, 코딩, 다국어 벤치마크에서 SOTA 또는 경쟁 성능을 달성할 수 있는가?
RQ2단일 모델 내 사고 모드와 비사고 모드를 통합하면 별도 모델 간 전환에 비해 사용성 및 효율성이 향상되는가?
RQ3사고 예산이 다양한 도메인에서 추론 지연 및 작업 성능에 어떤 영향을 미치는가?
RQ4강력-약한 증류가 가벼운 모델에서 강력한 성능을 얻는 데 얼마나 효과적인가?
RQ5다국어 지원을 119개 언어로 확장하는 것이 교차 언어 능력과 벤치마크 성과에 어떤 영향을 미치는가?

주요 결과

Qwen3-235B-A22B-Base는 여러 작업에서 더 큰 기반 모델들보다 활성화된 파라미터 수가 적으면서도 벤치마크에서 높은 성능을 달성한다.
MoE 기반은 활성화된 파라미터 수를 크게 줄이면서도 Dense 모델과 비슷하거나 그 이상을 달성해 비용 효율적인 추론과 강한 성능을 유지한다.
사후 학습 후 사고 모드와 비사고 모드 모두 선도적인 독점 모델 및 대형 MoE 모델과 경쟁력을 유지하며, 특히 코딩, 수학, 에이전트 작업에서 강력한 성능을 보인다.
사고 예산을 증가시킬수록 모든 작업에서 일관된 성능 향상을 얻을 수 있다.
Qwen3-235B-A22B는 AIME'24에서 85.7, AIME'25에서 81.5, LiveCodeBench v5에서 70.7, CodeForces에서 2,056, BFCL v3에서 70.8의 점수를 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.