QUICK REVIEW

[论文解读] Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective

Haichuan Wang, Tao Lin|arXiv (Cornell University)|Jan 31, 2026

Recommender Systems and Techniques被引用 0

一句话总结

该论文将对奖励设计建模为一个 Stackelberg 博弈，并证明阈值型奖励塑形可以高效地近似最优奖励模型，在推理时对齐中以最小开销提升用户效用。

ABSTRACT

Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.

研究动机与目标

直接最大化在 KL 正则化下学习到的奖励对用户效用的作用并非最优的动机说明。
将奖励模型设计形式化为领导者（奖励设计者）与跟随者（LLM）之间的 Stackelberg 博弈。
将最优奖励模型表征为阈值结构并给出一个可行的计算方法。
引入放宽的（软性）阈值变体以提高鲁棒性并防止对阈值的过拟合。
演示该方法与推理时对齐方法的集成，并给出经验收益。

提出的方法

将对齐问题形式化为一个 Stackelberg 双层优化，其中领导者选择奖励模型 r 以最大化用户效用，预期跟随者对 KL 正则化的响应。
证明最优奖励模型是一个阈值奖励 r_m，它在 r_U(x,y) 低于或高于一个与提示相关的阈值 m(x) 时分别赋值为 0 或 B。
推导出 m(x) 应满足 m(x) = E_{y~rho_r_m*} [r_U(x,y)]，从而得到一个与用户效用对齐的自洽阈值。
提供一个基于蒙特卡洛的程序，通过从基础策略采样来估计 F_x(m) 并使用二分搜索来计算 m*(x)。
引入一个软阈值变体 r_{m*,alpha}，使用 Sigmoid 函数提高鲁棒性，并证明随着 alpha 增大它收敛到最优解。
展示如何通过对离线数据的塑形与在塑形奖励下重新训练 Q 函数，将塑形整合到现有的推理时方法（CD 和 ARGS）中。

Figure 1 : We illustrate the Stackelberg game formulation of LLM alignment. In this framework, the reward model provider acts as the leader by selecting a reward model, while the LLM policy responds as the follower by solving the resulting alignment problem. The reward model provider’s goal is to ch

实验结果

研究问题

RQ1在 LLM 对齐中，KL 正则化下的最优奖励设计是否可解析地表征？
RQ2阈值型奖励塑形是否可以近似 Stackelberg 的最优并提升用户效用？
RQ3如何在实际中高效地计算最优阈值 m*(x)？
RQ4软阈值变体是否提高鲁棒性并缓解阈值附近的脆弱性？
RQ5将基于 Stackelberg 的奖励塑形与现有推理时方法集成是否可在可接受开销下提升平均奖励？

主要发现

领导者在 Stackelberg 形式下的阈值奖励模型是最优的，意味着应对输出分配为高真实奖励的情况给出 B，否则给出 0，并且阈值 m*(x) 满足 m*(x)=E_{y~rho_r*}[r_U(x,y)]。
一个基于蒙特卡洛的程序可以在实用的 LLM 场景中高效近似 m*(x)。
软阈值塑形（SRS）提供鲁棒性，随着塑形强度的增加，接近真实的 Stackelberg 最优，相较于直接使用 r_U 可以提升用户效用。
将 SRS 与推理时方法（CD 和 ARGS）结合，在保持多样性与连贯性与基线相近的同时获得更高的平均奖励。
GPT-4 评估显示在多种评测设置下，SRS 相对于香草与启发式基线具有稳定的胜/平优势，表明降低了奖励利用风险。

Figure 2 : Reward and GPT-4 win-tie rate as a function of the inference-time reward strength $\frac{1}{\beta}$ . The Win-Tie rate is compared with base model with no alignment. Solid lines denote the reward given by the reward model ,and dashed lines denote the Win-Tie rate.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。