QUICK REVIEW

[論文レビュー] Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment

Xinyu Zhang|arXiv (Cornell University)|Mar 23, 2026

Topic Modeling被引用数 0

ひとこと要約

NSRSA は自己訓練へ象徴的検証サブシステムを追加し、推論ステップレベルで訓練データをフィルタリングして誤伝播を防ぎ、GSM8Kの性能を強化し、クロス・タスク転移を促進する。

ABSTRACT

Recursive self-improvement--where a model iteratively trains on its own outputs--promises sustained capability growth but faces a fundamental obstacle: recursive drift. As models train on self-generated data across multiple iterations, errors in intermediate reasoning compound, leading to mode collapse and performance degradation. We propose Neuro-Symbolic Recursive Self-Alignment (NSRSA), which stabilizes iterative self-training by embedding a symbolic verification subsystem that gates training data quality at the reasoning step level. Unlike outcome-only filtering (which admits "lucky guesses" with flawed reasoning), NSRSA verifies each arithmetic operation via sympy, checks logical flow consistency across reasoning steps, and enforces domain constraints. We evaluate NSRSA on GSM8K using Qwen3-4B-Thinking across 5 self-training iterations under five conditions: no verification, outcome verification, majority voting, full NSRSA symbolic verification, and NSRSA with DPO. Our filtering analysis shows that NSRSA rejects approximately 34% of correct-answer solutions that pass outcome verification, eliminating "lucky guesses" with flawed reasoning from the training set. We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%). NSRSA provides an extensible framework that demonstrates how external symbolic verification can make recursive self-improvement measurable and reliable within domains where automated verification is available.

研究の動機と目的

再帰的自己改善を動機づけつつ、自己生成データにおける再帰的ドリフトに対処する。
訓練データ品質をゲートするためのステップレベル象徴的検証フレームワークを導入する。
象徴的に検証された推論が、反復を通じてより安定的で信頼できる再帰を生み出すことを示す。
検証ベースの学習がクロス・タスク転送を改善し、再現性のあるパイプラインを提供することを示す。

提案手法

自己訓練ループに象徴的検証サブシステムを埋め込み、4つのチェックで訓練データをゲートする：回答の正確性、sympy による算術検証、論理的フローの一貫性、ドメイン制約の充足。
4つの検証戦略を比較する：検証なし、結果のみ検証、多数決、全NSRSA象徴検証（オプションの DPO を含む）。
NSRSA を用いて自己生成解をファインチューニング前にフィルタリングし、Qwen3-4B-Thinking を用いた GSM8K で 5 回の自己訓練反復を評価。
NSRSA が検証済み対検証失敗ソリューションを用いて Direct Preference Optimization (DPO) ペアを構築し、運任せの推測より妥当な推論を教示する。
データ生成、検証、訓練、評価を含む再現可能なパイプラインを提供する。

Figure 1: NSRSA pipeline. At each iteration, the model generates multiple solutions per problem. The symbolic verification subsystem checks answer correctness, arithmetic validity (via sympy ), logical flow consistency, and domain constraints. Only solutions passing all checks enter the training set

実験結果

リサーチクエスチョン

RQ1ステップレベルの象徴的検証は、反復的自己訓練において結果のみの検証と比較して再帰的ドリフトを低減できるか。
RQ2NSRSA は GSM8K の精度、自己整合性、モード多様性を複数回の自己訓練反復でどう影響するか。
RQ3象徴的に検証された推論は MATH-500 へのクロス・タスク転送を改善し、DPO ポリシー学習の恩恵を受けるか。

主な発見

Condition	Base	Iter 1	Iter 2	Iter 3	Iter 4	Iter 5	Depth
No Verification	80.5	82.1	81.2	78.1	75.4	73.2	2
Outcome Verification	80.5	84.8	86.2	86.6	86.3	85.8	>5
Majority Voting	80.5	84.0	85.8	86.1	85.7	85.1	>5
NSRSA (Symbolic)	80.5	84.2	86.7	88.2	89.5	91.0	>5
NSRSA + DPO	80.5	84.5	87.1	88.9	90.1	91.2	>5

NSRSA は 5 回の反復で精度の成長を維持し、GSM8K の正答率を 91.0% に達成する一方、検証なしは崩壊し、結果のみ検証はプラトーとなる。
NSRSA は結果検証を通過する正答解の約 34% を拒否し、訓練データから誤った推論を除去する。
NSRSA 発の好み学習に基づく DPO は報酬精度を 46% から 63% に改善し、GSM8K 精度を 91.2% に導く（NSRSA の DPO なしは 91.0%）。
NSRSA は MATH-500 への正のクロス・タスク転送を達成し、45.5% から 51.2%（+5.7 ポイント）へ改善。
NSRSA は反復を通じて解の多様性を維持し（Self-BLEU が低い）、結果のみの手法と比較してモード崩壊を抑制している。

Figure 2: GSM8K accuracy across 5 self-training iterations. NSRSA (green) enables stable recursive improvement. Outcome verification (orange) plateaus after iteration 2. No verification (red) collapses by iteration 3.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。