QUICK REVIEW

[論文レビュー] Reliability-Aware Adaptive Self-Consistency for Efficient Sampling in LLM Reasoning

J. Kim, Nakyeong Yang|arXiv (Cornell University)|Jan 6, 2026

Topic Modeling被引用数 0

ひとこと要約

ReASC は、単一サンプルの意思決定と信頼度重み付き蓄積を用いて推論コストを削減しつつ精度を維持する、信頼性を考慮した二段階適応自己整合性フレームワークを LLM 推論に導入します。

ABSTRACT

Self-Consistency improves reasoning reliability through multi-sample aggregation, but incurs substantial inference cost. Adaptive self-consistency methods mitigate this issue by adjusting the sampling budget; however, they rely on count-based stopping rules that treat all responses equally, often leading to unnecessary sampling. We propose Reliability-Aware Adaptive Self-Consistency (ReASC), which addresses this limitation by reframing adaptive sampling from response counting to evidence sufficiency, leveraging response-level confidence for principled information aggregation. ReASC operates in two stages: a single-sample decision stage that resolves instances confidently answerable from a single response, and a reliability-aware accumulation stage that aggregates responses by jointly leveraging their frequency and confidence. Across five models and four datasets, ReASC consistently achieves the best accuracy-cost trade-off compared to existing baselines, yielding improved inference efficiency across model scales from 3B to 27B parameters. As a concrete example, ReASC reduces inference cost by up to 70\% relative to self-consistency while preserving accuracy on GSM8K using Gemma-3-4B-it.

研究の動機と目的

Self-Consistency (SC) の効率性改善を目的として、カウントベースの停止の非効率性に対処する。
回答レベルの信頼度を活用して証拠蓄積をガイドする二段階の信頼性認識フレームワークを提案する。
回答の信頼性が複数のモデルファミリとデータセットにわたる適応的サンプリング意思決定を改善することを示す。
3B から 27B のパラメータ規模にわたる精度とコストのトレードオフを量的に示し、精度を犠牲にせず大幅なコスト削減を実現する。

提案手法

Stage 1 (Single-Sample Decision) を導入し、信頼度ベースのゲート（tau_gate）を用いて単一の応答が十分な証拠を提供するかを判断する。
Stage 2 (Reliability-Aware Accumulation) を導入し、信頼度 S(y) を指数写像で重み付けした Beta 更新で証拠を集約する。
Beta 後方分布更新を用いてリード候補を追跡し、P(p1>p2|V) >= C_threshold および最大予算に達した時点で停止する。
Bottom 10% Group Confidence に基づく信頼度信号を、トークンレベルの自己確信から導出して応答の信頼性を推定する。
信頼度統計量（mu, sigma）とゲーティング閾値をオフラインまたはオンラインで校正する。オンラインではラベルが利用できない場合は二成分ガウス混合モデルを用いる。
オフラインのゲーティング閾値校正手順（Algorithm 1）とオンラインの校正手順（Algorithm 2）を提供する。

Figure 1: Count-based stopping may lead to inefficient evidence accumulation. Ignoring response reliability, count-based criteria may require unnecessary additional samples, while ReASC reaches the same decision with fewer samples.

実験結果

リサーチクエスチョン

RQ1回答レベルの信頼性を組み込むことで、LLM 推論の適応的自己整合性の効率性は向上するのか。
RQ2二段階フレームワーク（単一サンプル決定 + 信頼性認識蓄積）は、モデルファミリとデータセット全体で推論コストを削減しつつ精度を維持または向上させるのか。
RQ3信頼度重み付き証拠蓄積は、カウントベースの停止と比べてサンプル効率と安定性においてどう違うのか。
RQ4オフラインとオンライン設定で信頼度信号と決定閾値を校正する効果的な戦略は何か。

主な発見

ReASC は SC および既存の適応ベースラインと比較して、5モデル・4つの推論データセット全体で最も良い精度コストトレードオフ（Acc/TF）を達成する。
Gemma-3-4B-it を用いた GSM8K で、ReASC は自己整合性と比較して推論コストを最大 70% 削減しつつ精度を保持する。
Stage 1 はモデル規模が大きくなるにつれて、単一の応答で解決できるインスタンスの割合が増加することを示し、高い精度（ほとんどが90% を超える）を示す。
Stage 2 は Stage 1 で解決できなかったインスタンスに対して、カウントベースの停止と比べて推論コストを信頼性の高い方法で安定的に削減し、精度を維持する。
信頼度重み付き Beta 更新は、停止閾値への収束を速め、ASC の4更新 vs 7更新のようにサンプリング効率を大幅に改善する。
Stage-wise アブレーションは、Stage 1 と Stage 2 が補完的な役割を果たすことを示す。Stage 1 が不必要なサンプリングを減らし、Stage 2 が必要時には証拠蓄積を加速する。

Figure 2: Comparison of two confidence signals. Using Gemma 3 4B-Instruct on MATH500, Bottom 10% Group Confidence shows a larger separation between correct and incorrect responses than Response-level Self-Certainty.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。