QUICK REVIEW

[논문 리뷰] Triplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLMs

Yibo Wang, Hai-Long Sun|arXiv (Cornell University)|2026. 01. 13.

Topic Modeling인용 수 0

한 줄 요약

논문은 Triplet 기반 자기놀이 미세조정(T-SPIN)을 소개합니다. 이 방법은 historical 및 proto-synthetic triplets와 엔트로피 제약을 추가하여 LLM의 자기놀이 미세조정을 안정화하고, 주석 데이터가 제한된 상황에서 강력한 성능을 달성합니다.

ABSTRACT

Recently, self-play fine-tuning (SPIN) has been proposed to adapt large language models to downstream applications with scarce expert-annotated data, by iteratively generating synthetic responses from the model itself. However, SPIN is designed to optimize the current reward advantages of annotated responses over synthetic responses at hand, which may gradually vanish during iterations, leading to unstable optimization. Moreover, the utilization of reference policy induces a misalignment issue between the reward formulation for training and the metric for generation. To address these limitations, we propose a novel Triplet-based Self-Play fIne-tuNing (T-SPIN) method that integrates two key designs. First, beyond current advantages, T-SPIN additionally incorporates historical advantages between iteratively generated responses and proto-synthetic responses produced by the initial policy. Even if the current advantages diminish, historical advantages remain effective, stabilizing the overall optimization. Second, T-SPIN introduces the entropy constraint into the self-play framework, which is theoretically justified to support reference-free fine-tuning, eliminating the training-generation discrepancy. Empirical results on various tasks demonstrate not only the superior performance of T-SPIN over SPIN, but also its stable evolution during iterations. Remarkably, compared to supervised fine-tuning, T-SPIN achieves comparable or even better performance with only 25% samples, highlighting its effectiveness when faced with scarce annotated data.

연구 동기 및 목표

기존 LLM의 자기놀이 미세조정(SPIN)에서의 불안정성 및 정렬 문제를 해결합니다.
역사적 샘플과 proto-synthetic 샘플을 활용하기 위한 트리플 입력 프레임워크를 제안합니다.
참조 없이 학습 목표를 엔트로피 제약으로 도입하여 학습과 생성의 정렬을 달성합니다.

제안 방법

실제(주석된) 샘플, 합성 샘플, 초기 정책으로부터 얻은 proto-synthetic 샘플을 트리플 입력으로 도입하여 최적화를 안정화합니다.
메인/오픈먼트 업데이트 방식에서 메인 정책이 현재 및 역사적 이점을 가진 트리플 샘플로부터 학습하도록 합니다.
SPIN의 참조 정책 보상을 학습 신호와 일치시키는 신뢰도 기반 손실로 대체합니다(r(z|x) = α log πθ(z|x)).
학습 신뢰도 c(x,y)를 학습하고 닫힌 형태의 오펜트 정책을 도입하는 IPM 영감을 받은 목표를 활용합니다. 이 정책은 c에 대해 소프트맥스으로 환원됩니다.
참조 의존 학습 신호 없이 현재/역사적 이점을 통합하는 엔드-투-엔드 손실 L_T-SPIN(θ)를 정의합니다(식 (7)와 유사).
주요 및 상대 선수 간의 교대 최적화를 상세히 설명하는 알고리즘 1을 제공합니다.

Figure 1 : Comparisons of three strategies: (a) supervised fine-tuning requires large amounts of annotated data to train $\pi_{\theta}$ ; (b) self-play fine-tuning operates with limited annotated data and iteratively generated samples, and employs the previous policy $\pi_{\theta_{t}}$ as a referenc

실험 결과

연구 질문

RQ1트리플 입력(주석된 샘플, 합성 샘플, proto-synthetic 샘플) 도입이 자기놀이 미세조정의 안정성 및 성능에 어떤 영향을 미치는가?
RQ2참조 정책을 제거하고 엔트로피 제약이 있는 상대를 도입하면 학습 보상과 생성 간의 정렬이 더 잘 달성되는가?
RQ3역사적 이점이 반복적 미세조정에서 수렴에 어떤 영향을 미치는가?
RQ4제한된 주석 데이터 하에서 T-SPIN이 SPIN 및 SFT 대비 다양한 작업에서 어떤 성능을 보이는가?

주요 결과

T-SPIN은 Zephyr-7B에서 SPIN보다 성능이 향상되고 반복 간 진화가 더 안정적으로 나타납니다.
전체 데이터를 사용하는 SFT와 비교할 때, 주석 샘플이 50k에 불과한 경우에도 평균적으로 동등하거나 더 나은 결과를 달성합니다.
T-SPIN은 학습 보상과 생성 지표를 정렬하고 SPIN에서 관찰되던 불일치를 완화합니다.
주석 데이터의 25% 만 사용해도, 제시된 작업들에서 전체 데이터로 감독 미세조정과 비슷하거나 더 나은 성능을 달성합니다.
Ultrachat200k의 50k 주석 샘플로 Zephyr-7B 및 Mistral-7B에 대한 실험에서 특히 수학 및 명령 이행 작업에서 주목할 만한 이점을 보였습니다.
T-SPIN은 여러 반복에 걸쳐 안정적인 성능을 보여주는 반면, SPIN은 초기 정점 이후 성능 저하가 나타날 수 있습니다.

Figure 2 : Performance (%) comparisons between $\mathtt{T}\mbox{-}\mathtt{SPIN}$ and $\mathtt{SPIN}$ on two tasks: GSM8K and IFEval over $5$ iterations. The average scores over $10$ different tasks are also illustrated in the right panel.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.