QUICK REVIEW

[논문 리뷰] Zephyr: Direct Distillation of LM Alignment

Lewis Tunstall, Edward Beeching|arXiv (Cornell University)|2023. 10. 25.

Topic Modeling인용 수 53

한 줄 요약

Zephyr-7B는 distilled supervised fine-tuning (dSFT) 및 distilled direct preference optimization (dDPO)을 AI 피드백으로부터 사용하여 작은 공개 LM을 정렬하고, 7B 채팅에서 SOTA 성능을 달성하며 인간 주석 없이도 훨씬 더 큰 모델들과 경쟁력 있는 결과를 제공합니다.

ABSTRACT

We aim to produce a smaller language model that is aligned to user intent. Previous research has shown that applying distilled supervised fine-tuning (dSFT) on larger models significantly improves task accuracy; however, these models are unaligned, i.e. they do not respond well to natural prompts. To distill this property, we experiment with the use of preference data from AI Feedback (AIF). Starting from a dataset of outputs ranked by a teacher model, we apply distilled direct preference optimization (dDPO) to learn a chat model with significantly improved intent alignment. The approach requires only a few hours of training without any additional sampling during fine-tuning. The final result, Zephyr-7B, sets the state-of-the-art on chat benchmarks for 7B parameter models, and requires no human annotation. In particular, results on MT-Bench show that Zephyr-7B surpasses Llama2-Chat-70B, the best open-access RLHF-based model. Code, models, data, and tutorials for the system are available at https://github.com/huggingface/alignment-handbook.

연구 동기 및 목표

사용자 의도에 맞춰 더 작은 언어 모델을 만들고자 하는 목표.
더 큰 교사들로부터의 증류와 AI가 생성한 선호로 고정된 정렬에 도달할 수 있음을 보이고자 함.
7B 기반의 dDPO가 주요 벤치마크에서 70B 파라미터 채팅 모델과 일치하거나 능가할 수 있음을 보이고자 함.
인간 주석 없이 정렬을 위한 재현 가능한 트레이닝 레시피와 데이터세트, 코드 공개.

제안 방법

자체 지시 스타일의 데이터(UltraChat)를 구성하고 7B 기초 모델에 대해 distilled SFT(dSFT)를 적용한다.
선생 개완(ai 피드백)과 GPT-4 채점(UltraFeedback)으로 다수의 교사 완성과 선호 데이터를 생성하여 AI 피드백을 수집한다.
현재 정책과 기준 정책에서 파생된 보상 모델을 가진 distill된 직접 선호 최적화(dDPO)를 적용한다.
TRL, DeepSpeed ZeRO-3, FlashAttention-2를 사용한 80GB 메모리의 16개의 A100에서 학습한다.
초기 dSFT 모델에서 한 에폭으로 시작하여 세 에폭의 DPO로 미세 조정한다.

실험 결과

연구 질문

RQ1작은 오픈 LM이 인간 주석 없이 증류를 통해 사용자 의도에 정렬될 수 있는가?
RQ2AI가 생성한 선호를 바탕으로 한 dDPO가 더 큰 인간 정렬 모델에 비해 경쟁력 있는 정렬을 달성하는가?
RQ37B 모델에서 표준 채팅 벤치마크에 대한 dSFT와 dDPO의 결합 효과는 어떤가?
RQ4Zephyr-7B가 MT-Bench, AlpacaEval, Open LLM Leaderboard 과제에서 오픈 및 독점 채팅 모델과 어떻게 비교되는가?

주요 결과

Zephyr-7B가 MT-Bench에서 오픈 7B 채팅 모델 중 새로운 SOTA를 달성하여 7.34를 기록하고 AlpacaEval에서 많은 오픈/오픈-대조 모델을 능가하는 승률(90.60%)을 달성했다.
더 큰 오픈 모델과 비교했을 때 Zephyr-7B는 MT-Bench에서 Llama2-Chat-70B와 경쟁적이며 AlpacaEval에서 표준 편차의 두 배 이내에 있다.
dDPO는 dSFT만으로보다 채팅 능력을 크게 향상시키며, 절삭 연구에서 dDPO + dSFT가 MT-Bench와 AlpacaEval 전반에서 최상의 성능을 보였다.
초기 dSFT 이후 한 에폭의 DPO가 유익하며, 더 오랜 초기 SFT 후 확장된 DPO는 하류 성능을 저하시킬 수 있어 정렬 단계를 신중하게 스케줄링해야 함을 시사한다.
Zephyr-7B가 일부 벤치마크에서 독점 모델과의 격차를 좁히고 MT-Bench에서 Llama2-Chat-70B를 능가하는 경우도 있으며 오픈 소스 상태를 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.