QUICK REVIEW

[논문 리뷰] Consensus Group Relative Policy Optimization for Text Generation

Yuki Ichihara, Yuu Jinnai|arXiv (Cornell University)|2026. 02. 03.

Topic Modeling인용 수 0

한 줄 요약

C-GRPO는 MBR 합의를 단일 패스, 참조-free 정책 훈련 방식으로 증류하여 MT 및 요약에서 추론 시 재랭크 없이 MBR 수준의 품질을 달성한다.

ABSTRACT

Many strong decoding methods for text generation follow a sample-and-rerank paradigm: they draw multiple candidates, score each under a utility (reward) function using consensus across samples, and return the best one. Although effective, these methods incur high computational costs during inference due to repeated sampling and scoring. Prior attempts to amortize inference-time computation typically rely on gold references, teacher labels, or curated preference data, increasing dataset construction effort and the demand for high-fidelity reward models. We propose Consensus Group Relative Policy Optimization (C-GRPO), which distills Minimum Bayes Risk (MBR) decoding into training by formulating the consensus utility as a group-relative objective within GRPO. C-GRPO requires only a utility function and policy samples, without gold references or explicit preference labels. Under ideal conditions, we show that the objective function of C-GRPO is directionally aligned with the gradient of the expected-utility objective underlying MBR decoding, leading to a convergence guarantee. Experiments on machine translation (WMT 2024) and text summarization (XSum) demonstrate that C-GRPO successfully achieves performance comparable to MBR decoding without the associated inference-time overhead, while outperforming reference-free baseline methods.

연구 동기 및 목표

합의 기반 텍스트 생성을 추론 시 비용을 줄이기 위해 이를 훈련으로 증류한다.
작업 유틸리티 함수와 온-policy 샘플만으로 보상/모델 프리 학습을 가능하게 한다.
제안된 접근 방식에 대한 이론적 정렬성과 수렴 보장을 제공한다.
gold references 없이 MT 및 요약 벤치마크에서 효과를 입증한다.

제안 방법

그룹-상대 GRPO 목표를 내부 그룹 합의 유틸리티를 사용하여 공식화한다.
합의 유틸리티를 샘플링된 후보 그룹 내의 쌍 간 평균 유사도로 정의한다(자체 합의).
명시적 보상 감독 없이 그룹 상대 이점을 최대화하도록 단일 패스 정책을 학습한다.
완만한 가정 하에서 기대 GRPO 업데이트가 목표 MBR 목표의 기울기와 정렬됨을 증명한다.
MT(En→Ja/Zh/De) 및 XSum 요약에서 MBR 및 GRPO 기준선과 비교 평가한다.

실험 결과

연구 질문

RQ1.gold reference나 명시적 선호 데이터 없이도 합의 기반 디코딩을 단일 패스로 증류할 수 있는가?
RQ2C-GRPO는 학습 업데이트를 MBR 목표의 기울기와 정렬시키고 효율적으로 수렴하는가?
RQ3MT 및 요약 작업에서 C-GRPO는 MBR 및 참조 없는 기준선과 비교하여 어떤 성능을 보이는가?
RQ4학습된 정책은 모델 계열 및 규모에 걸쳐 강건한가?

주요 결과

모델	베이스 모델/방법	ROUGE-Lsum ↑ (XSum)
Llama	Base Model	0.361
Llama	GRPO w/ Random	0.320
Llama	MBR decoding	0.361
Llama	GRPO w/ Self-Rewarding	0.229
Llama	SFT w/ MBR generations	0.351
Llama	C-GRPO (Ours)	0.419
Llama	C-Dr. GRPO (Ours)	0.414
Mistral	Base Model	0.230
Mistral	GRPO w/ Random	0.222
Mistral	MBR decoding	0.245
Mistral	GRPO w/ Self-Rewarding	0.232
Mistral	SFT (MBR decoding)	0.233
Mistral	C-GRPO (Ours)	0.243
Mistral	C-Dr.GRPO (Ours)	0.231

C-GRPO는 추론 시 재랭크 없이 MOS 유사한 MBR 품질을 달성해 MT 및 요약에서 MBR과 동등하거나 이를 능가한다.
C-GRPO는 XSum에서 ROUGE-Lsum를 자주 향상시키고(0.419) 번역에 대해 모델 간 MBR 및 GRPO 기준선을 능가한다.
Llama 및 Mistral로 En→Ja/Zh/De에서 C-GRPO가 방법들 중 가장 강한 평균 COMET 점수를 산출한다.
C-Dr.GRPO는 더 보수적인 업데이트를 가지는 변형으로도 작업 전반에서 강한 성능과 안정성을 유지한다.
C-GRPO는 Llama, Mistral, Qwen 등 모델 계열과 규모에 걸쳐 강건함을 보여주며, 요약에서 매우 작은 모델에서 일부 저하가 나타나는 경우를 제외하면 대체로 견고하다.
JBBQ 결과는 일본어 QA에서 기본 모델 대비 정확도를 향상시키며 MBR 및 자기 보상 기준선을 능가한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.