QUICK REVIEW

[논문 리뷰] Reward Shaping for Inference-Time Alignment: A Stackelberg Game Perspective

Haichuan Wang, Tao Lin|arXiv (Cornell University)|2026. 01. 31.

Recommender Systems and Techniques인용 수 0

한 줄 요약

본 논문은 LLM 정렬을 위한 보상 설계를 Stackelberg 게임으로 모델링하고 임계값 기반 보상 형성 체계가 최적 보상 모델을 효율적으로 근사할 수 있으며, 추론 시간 정렬에서 최소한의 오버헤드로 사용자 유틸리티를 향상시킨다는 것을 보인다.

ABSTRACT

Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.

연구 동기 및 목표

KL 규제 하에 학습된 보상을 직접 최대화하는 것이 왜 사용자 유틸리티에 비최적인지 동기를 제시한다.
리더(보상 디자이너)와 팔로워(LLM) 간의 Stackelberg 게임으로 보상 모델 설계를 공식화한다.
최적 보상 모델을 임계값 기반 구조로 특성화하고 이를 계산하는 실용적 방법을 제시한다.
임계값에 대한 과적합을 방지하고 강건성을 높이기 위한 완화(소프트) 임계값 변형을 도입한다.
추론 시간 정렬 방법과의 통합을 시演하고 실증적 이득을 보인다.

제안 방법

리더가 보상 모델 r을 선택해 사용자 유 utility를 최대화하도록 두 레벨 Stackelberg 최적화로 정렬 문제를 구성하고, 팔로워의 KL-정규화된 응답을 예상한다.
최적 보상 모델이 임계값 m(x)에 따라 r_U(x,y)가 프롬프트 의존 임계값보다 작으면 0을, 크면 B를 부여하는 임계 보상 r_m임을 보인다.
m(x)가 m*(x) = E_{y~rho_r_m*} [r_U(x,y)]를 만족하도록 도출되어 사용자 유틸리티와 일치하는 자기 일관 임계값을 만든다.
기저 정책에서의 샘플로 F_x(m)를 추정하고 이분법 탐색으로 m*(x)를 계산하는 몬테카를로 기반 절차를 제시한다.
강건성을 높이기 위해 시그모이드를 사용하는 소프트 임계값 변형 r_{m*,alpha}를 도입하고, alpha가 증가함에 따라 최적 해에 수렴함을 보인다.
오프라인 데이터를 형성하고 형성된 보상 하에서 Q-함수를 재학습시켜 기존의 추론 시간 방법(CD 및 ARGS)에 형성 기법을 통합하는 방법을 보인다.

Figure 1 : We illustrate the Stackelberg game formulation of LLM alignment. In this framework, the reward model provider acts as the leader by selecting a reward model, while the LLM policy responds as the follower by solving the resulting alignment problem. The reward model provider’s goal is to ch

실험 결과

연구 질문

RQ1KL 규제하에 최적 보상 설계가 LLM 정렬에 대해 분석적으로 특징 지어질 수 있는가?
RQ2임계값 기반 보상 형성 체계가 Stackelberg 최적해를 근사하고 사용자 유틸리티를 향상시키는가?
RQ3실제로 최적 임계값 m*(x)를 어떻게 효율적으로 계산할 수 있는가?
RQ4소프트 임계값 변형이 강건성을 개선하고 임계값 근처의 취약한 동작을 완화하는가?
RQ5Stackelberg 기반 보상 형성을 기존의 추론 시간 방법에 비교적 적은 오버헤드로 통합하고 평균 보상을 개선할 수 있는가?

주요 결과

Stackelberg 형식에서 리더에게 임계 보상 모델이 최적이며, 진짜 보상이 높은 출력에는 B를, 그렇지 않은 출력에는 0을 할당해야 하고 임계값 m*(x)는 m*(x)=E_{y~rho_r*}[r_U(x,y)]를 만족한다.
몬테카를로 기반 절차가 LLM 실무에 필요한 m*(x)를 효율적으로 근사할 수 있다.
소프트 임계값 형성(SRS)은 강건성을 제공하고 형성 강도가 커질수록 진정한 Stackelberg 최적해에 근접하며, r_U를 직접 사용하는 것보다 사용자 유틸리티를 향상시킨다.
SRS를 추론 시간 방법(CD 및 ARGS)과 통합하면 평균 보상을 높이는 동시에 다양성과 일관성은 기준선과 유사하게 유지된다.
GPT-4 평가에서 SRS는 다수의 평가 설정에서 일반형(vanilla) 및 규칙 기반 기준선 대비 일관된 승리-무승부 이점을 보였으며 보상 해킹 위험이 감소함을 시사한다.

Figure 2 : Reward and GPT-4 win-tie rate as a function of the inference-time reward strength $\frac{1}{\beta}$ . The Win-Tie rate is compared with base model with no alignment. Solid lines denote the reward given by the reward model ,and dashed lines denote the Win-Tie rate.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.