QUICK REVIEW

[論文レビュー] Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective

Haichuan Wang, Tao Lin|arXiv (Cornell University)|Jan 31, 2026

Recommender Systems and Techniques被引用数 0

ひとこと要約

この論文はLLM整列の報酬設計をStackelbergゲームとしてモデル化し、閾値ベースの報酬整形スキームが最適報酬モデルの効率的な近似を提供し、推論時整合におけるユーザー効用を最小オーバーヘッドで改善することを示す。

ABSTRACT

Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.

研究の動機と目的

KL正則化の下で直接学習された報酬を最大化することが、ユーザー効用にとって必ずしも最適でない理由を動機づける。
リーダー（報酬設計者）とフォロワー（LLM）との間のStackelbergゲームとして報酬モデル設計を定式化する。
最適報酬モデルを閾値ベースの構造として特徴づけ、実用的な計算方法を提供する。
頑健性を高め、閾値への過剰適合を防ぐためのソフト閾値変種を導入する。
このアプローチを推論時整合手法と統合し、経験的利得を示す。

提案手法

リーダーが報酬モデルrを選択してユーザー効用を最大化しつつ、フォロワーのKL正則化付き応答を予見する、Stackelberg二レベル最適化として整合問題を定式化する。
最適報酬モデルは0またはBを割り当てる閾値報酬r_mであり、r_U(x,y)がプロンプト依存閾値m(x)を下回るか上回るかで決まることを証明する。
m(x)はm(x) = E_{y~rho_r_m*} [r_U(x,y)]を満たすべきであり、ユーザー効用と整合する自己整合的閾値を作成する。
基本ポリシーからのサンプルを用いてF_x(m)を推定し二分探索を用いてm*(x)をモンテカルロ法で計算する手順を提供する。
頑丈性を高め、閾値近傍での brittle な挙動を抑えるためにシグモイドを用いるソフト閾値変種r_{m*,alpha}を導入し、αが大きくなるにつれて最適解へ収束することを示す。
CDおよびARGSといった既存の推論時手法へ整形を組み込み、形作った報酬の下でQ-functionを再訓練することで組み込み方法を示す。

Figure 1 : We illustrate the Stackelberg game formulation of LLM alignment. In this framework, the reward model provider acts as the leader by selecting a reward model, while the LLM policy responds as the follower by solving the resulting alignment problem. The reward model provider’s goal is to ch

実験結果

リサーチクエスチョン

RQ1KL正則化の下での最適な報酈設計を分析的に特徴付けられるか。
RQ2閾値ベースの報酬整形はStackelberg最適解を近似し、ユーザー効用を改善するか。
RQ3実務での最適閾値m*(x)を効率的に計算する方法は。
RQ4ソフト閾値変種は頑健性を高め、閾値付近の不安定挙動を緩和するか。
RQ5Stackelbergベースの報酬整形を既存の推論時手法と組み合わせ、オーバーヘッドを抑えつつ平均報酬を改善できるか。

主な発見

Stackelberg設定のリーダーに対して閾値報酬は最適であり、高い真の報酬を出力にはBを、そうでなければ0を割り当てるべきで、閾値m*(x)はm*(x)=E_{y~rho_r*}[r_U(x,y)]を満たす。
実務での近似にはモンテカルロ法ベースの手法が効率的にm*(x)を近似できる。
ソフト閾値整形(SRS)は頑健性を提供し、整形強度が大きくなるにつれて真のStackelberg最適解に近づき、直接的なr_Uの使用よりもユーザー効用を改善する。
SRSを推論時手法(CDおよびARGS)と統合することで、ベースラインと同程度の多様性・コヒーレンスを維持しつつ平均報酬を高める。
GPT-4評価は、SRSがバニラおよびヒューリスティックなベースラインに対して複数の評価設定で一貫して勝利・引き分け優位を示し、報酬ハッキングリスクの低減を示唆する。

Figure 2 : Reward and GPT-4 win-tie rate as a function of the inference-time reward strength $\frac{1}{\beta}$ . The Win-Tie rate is compared with base model with no alignment. Solid lines denote the reward given by the reward model ,and dashed lines denote the Win-Tie rate.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。