QUICK REVIEW

[論文レビュー] ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following

Yuancheng Yang, Lin Yang|arXiv (Cornell University)|Feb 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

ImpRIF は、暗黙の推論を検証可能な推論グラフとして形式化することで複雑な指示への従事を改善し、グラフ駆動の CoT および RL ベースの訓練を可能にし、同程度のサイズのベースモデルより優れた性能を発揮します。

ABSTRACT

As applications of large language models (LLMs) become increasingly complex, the demand for robust complex instruction following capabilities is growing accordingly. We argue that a thorough understanding of the instruction itself, especially the latent reasoning structure embedded between the lines, is crucial for improving instruction following. Therefore we target complex instructions that involve implicit reasoning, intricate logical relations, and multi-constraint dependencies. We propose ImpRIF, a method to enhance LLMs' understanding of implicit reasoning instructions, thereby improving its ability to follow complex instructions. We formalize such instructions as verifiable reasoning graphs, enabling programmatic verification and graph-driven chain-of-thought reasoning. Based on this formulation, we synthesize large-scale single- and multi-turn data, propose fine-tuning with graph reasoning, and apply reinforcement learning to explicitly train models to reason along the graph. On five complex instruction following benchmarks, our models substantially outperform their base models. These results demonstrate that enhancing implicit reasoning capabilities can significantly improve complex instruction following. This project will be open-sourced in the near future.

研究の動機と目的

Implicit reasoning と複数の制約依存を含む指示に対する堅牢な従事を動機づける。
推論を明示的な推論グラフとして形式化し、検証と構造化推論を可能にする。
検証可能な制約条件を備えた ERG 構造から大規模な単一-turn および複数-turn の implicit reasoning データを生成する。
グラフベースの思考パターンでの教師付き微調整（SFT）とプロセス検証型強化学習でモデルを訓練する。
複数のベンチマークで同等のパラメータ規模において最先端の性能を示す。

提案手法

推論を DAG（有向非巡回グラフ）として formalize し、ノードを条件付き・数学的・知識ベースのアクションとし、エッジが依存関係をエンコードする。
VERIFIABLE constraints と CoT データを備えた ERG 構造から大規模な単一-turn および多-turn の指示を生成する。
ROOT-TO-LEAF 順序で依存関係を辿り、すべての制約を検証する ERG-aligned thinking pattern を用いた SFT を実施する。
制約遵守と思考品質を共同評価する多粒度報酬を備えた GRPO ベースの強化学習を適用する。
RL ではモデル生成の推論と ERG CoT 参照を比較する thinking-process reward と、優れた出力を促す部分報酬設計を用いる。
Qwen3 および DeepSeek ベースのベースラインへ方法を適用してクロスアーキテクチャ一般化を示し、思考パターンと報酬のアブレーションを行う。

実験結果

リサーチクエスチョン

RQ1Implicit reasoning instructions を検証可能な推論グラフとして効果的にモデル化して指示遵守を改善できるか。
RQ2グラフ指向の推論とプロセス検証型 RL は mid-scale モデルの複雑な指示ベンチマークで改善をもたらすか。
RQ3ERG-aligned thinking と他の思考パターンの指示遵守と推論品質への影響はどうなるか。
RQ4ImpRIF は base を超えるモデルアーキテクチャ（DeepSeek や Llama 派生変種など）に対してどの程度一般化するか。
RQ5SFT と RL およびその組み合わせが多制約 implicit instructions の性能にどのように寄与するか。

主な発見

ImpRIF-32B は五つの複雑な指示ベンチマークでより大きいモデルや独自モデルと比較して競合的または優れている。
SFT の後に RL を適用すると、ベンチマーク全体で SFT のみまたは RL のみより一貫して性能が高い。
ERG 風の CoT は、小型モデルでは元の思考や構造化思考より性能を向上させることが多く、より大きいモデルでも競合力を維持する。
thinking-process rewards と部分報酬は、特に論理的に重いベンチマークで相乗効果を生む。
クロスアーキテクチャ実験では、異なる基盤モデルへ ImpRIF を適用した場合、一定の改善が見られる。
LogicBench では ImpRIF variant が base モデルを上回り、暗黙的推論の強化の転移性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。