QUICK REVIEW

[論文レビュー] CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

Zhongyuan Peng, Caijun Xu|arXiv (Cornell University)|Feb 2, 2026

Topic Modeling被引用数 0

ひとこと要約

CoDiQ は六つの難易度向上戦略、検証パイプライン、そして RL 調整ジェネレータを導入し、スケールで高難易度かつ解答可能な問題を合成するテスト時スケーリングを提供。訓練後にLRM の推論能力を改善する 44K 問題コーパスを生み出す。

ABSTRACT

Large Reasoning Models (LRMs) benefit substantially from training on challenging competition-level questions. However, existing automated question synthesis methods lack precise difficulty control, incur high computational costs, and struggle to generate competition-level questions at scale. In this paper, we propose CoDiQ (Controllable Difficult Question Generation), a novel framework enabling fine-grained difficulty control via test-time scaling while ensuring question solvability. Specifically, first, we identify a test-time scaling tendency (extended reasoning token budget boosts difficulty but reduces solvability) and the intrinsic properties defining the upper bound of a model's ability to generate valid, high-difficulty questions. Then, we develop CoDiQ-Generator from Qwen3-8B, which improves the upper bound of difficult question generation, making it particularly well-suited for challenging question construction. Building on the CoDiQ framework, we build CoDiQ-Corpus (44K competition-grade question sequences). Human evaluations show these questions are significantly more challenging than LiveCodeBench/AIME with over 82% solvability. Training LRMs on CoDiQ-Corpus substantially improves reasoning performance, verifying that scaling controlled-difficulty training questions enhances reasoning capabilities. We open-source CoDiQ-Corpus, CoDiQ-Generator, and implementations to support related research.

研究の動機と目的

競技レベルで解ける問題のスケーラブルな合成を動機づけ、LRMs の推論を推進する。
テスト時にスケールする可観測性を確保しつつ、妥当性を保証する可制御難易度フレームワークを開発する。
難易度と解法可能性のバランスを取る検証・ランキング機構を構築する。
CoDiQ-Corpus（44K 問題）と専用ジェネレータを構築し、下流の推論を改善する。
CoDiQ リソースのオープンソース化を実現し、さらなる研究を支援する。

提案手法

LLM に難易度の高い要素を質問へ注入する six 次の Difficulty-Enhancement Strategies を導入する。
反復的な改良と二つの検証モジュール（難易度推定と解法可能性検証）を備えた CoDiQ Pipeline を提案する。
LLM ベースのランキング（LLMs-Ranking）と ValueNetwork 採点法（DS-VN）を用いた相対難易度パラダイムを確立し、連続的な難易度スコアを生成する。
モデル間の質問生成能力を標準化して評価する CoDiQ-Bench を開発する。
難易度進行と妥当性信号を最適化するための強化学習による CoDiQ-Generator を構築する。
CoDiQ-Corpus を作成し、競技レベルの数学・コーディング質問 44K 問を収集、訓練時の推論改善を検証する。

Figure 1 : Distribution of CoDiQ-Corpus Dataset

実験結果

リサーチクエスチョン

RQ1テスト時スケーリングを用いて、解法可能性を維持しつつ質問の難易度を controllably 高めるにはどうすべきか？
RQ2無効または解けない問題を生み出さずに、質問生成へ難易度を注入する効果的な戦略は何か？
RQ3強化学習で学習したジェネレータは高難易度で解ける問題の上限をさらに押し上げられるか？
RQ4制御難易度の質問コーパスで LRMs を訓練すると下流の推論性能は向上するか？
RQ5自動質問合成における難易度・解法可能性・計算コストのトレードオフはどうなるか？

主な発見

Model	Rounds	Tokens	DR-LLM	DR-VN	DR(AVG)
GPT-OSS-20B	2.9	5528.2	68.5	74.4	71.5
GLM-4.6	2.8	3385.8	71.2	65.8	68.5
Qwen3-32B	2.3	1239.3	50.6	54.8	52.7
Qwen3-8B	3.4	1130.5	39.2	59.6	49.4
GLM-Z1-9B-0414	2.7	1229.8	48.8	43.7	46.3
Qwen3-14B	3.1	2076.4	45.9	44.4	45.2
Qwen3-4B	2.8	1419.7	49.1	42.7	45.9
Qwen3-1.7B	3.3	844.5	25.6	37.1	31.4
Qwen3-0.6B	2.4	314.3	17.2	35.0	26.1
CoDiQ Prompt(ours) GLM-4.6	2.7	7143.8	73.2	83.3	78.3
CoDiQ Prompt(ours) GPT-OSS-20B	2.1	8057.3	63.8	61.5	62.7
CoDiQ Prompt(ours) Qwen3-32B	2.2	4893.6	63.0	46.5	54.8
CoDiQ Prompt(ours) Qwen3-14B	2.6	5281.9	53.9	44.2	49.1
CoDiQ Prompt(ours) Qwen3-4B	2.8	4422.3	49.1	42.7	45.9
CoDiQ Prompt(ours) Qwen3-8B	2.4	4155.6	49.8	41.9	45.8
CoDiQ Generator(ours) CoDiQ-Gen-8B	3.4	7499.6	58.9	58.1	58.5

CoDiQ のプロンプトは、モデル間で推論トークンの使用量と生成質問の難易度を向上させる。
CoDiQ-Generator（8B）は RL 整合性のため高難易度で解ける質問を生成する点で Qwen3-32B などの大規模モデルより優れる。
トークン予算分析により、難易度が高い方が生成可能性の範囲内でより多くのトークンを消費することが示される。
解法可能性検証を除外すると、観測される難易度の天井が上昇し、検証器が解ける難易度の Frontier を制限していることを示す。
CoDiQ-Corpus は DS-LLM および DS-VN 指標に基づく平均難易度で、AIME、NuminaMath-1.5、LiveCodeBench、Code-Contests より高い難易度を達成する。
CoDiQ-Corpus を用いたカリキュラム学習実験は、ベースラインと比較して MATH-500 および AIME-2024 の性能を改善する。

Figure 2 : Question Difficulty Scaling on CoDiQ-Bench. Scatter plot showing the relationship between average reasoning tokens and difficulty ranking (DR-AVG) for models using CoDiQ Prompt. Each point represents a model, demonstrating the positive correlation between increased reasoning computation a

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。