QUICK REVIEW

[논문 리뷰] Orca-Math: Unlocking the potential of SLMs in Grade School Math

Arindam Mitra, Hamed Khanpour|arXiv (Cornell University)|2024. 02. 16.

Cognitive and developmental aspects of mathematical skills인용 수 7

한 줄 요약

Orca-Math는 200K 합성 수학 문제로 훈련된 7B 매개변수 SLM이 외부 도구나 앙상블 없이 반복적 선호 학습과 고품질 에이전트 생성 데이터 세트를 통해 GSM8K pass@1에서 86.81%에 도달할 수 있음을 보여준다.

ABSTRACT

Mathematical word problem-solving has long been recognized as a complex task for small language models (SLMs). A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters. To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors. Additionally, they employ ensembling, where outputs of up to 100 model runs are combined to arrive at a more accurate result. Result selection is done using consensus, majority vote or a separate a verifier model used in conjunction with the SLM. Ensembling provides a substantial boost in accuracy but at a significant cost increase with multiple calls to the model (e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5). In this work, we present Orca-Math, a 7-billion-parameter SLM based on the Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model calls or the use of verifiers, code execution or any other external tools. Our approach has the following key elements: (1) A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data, (2) An iterative learning techniques that enables the SLM to practice solving problems, receive feedback on its solutions and learn from preference pairs incorporating the SLM solutions and the feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves 81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It also significantly outperforms other smaller models while using much smaller data (hundreds of thousands vs. millions of problems).

연구 동기 및 목표

초등 수준의 단어 문제에 대한 소형 언어 모델(SLM)의 수학적 추론 개선의 필요성을 자극한다.
SLM 추론 능력을 향상시키기 위한 고품질 합성 데이터 파이프라인과 반복 학습을 제안한다.
적은 데이터와 외부 도구 없이도 더 작은 모델이 GSM8K에서 더 큰 모델을 능가할 수 있음을 입증한다.

제안 방법

GPT-4-Turbo 솔루션을 통한 에이전트 기반 데이터 생성 파이프라인으로 Orca-Math-dataset 200K 수학 문제를 생성한다.
지도학습 미세조정, 학생 풀이, 해결책에 대한 교사 피드백의 반복 학습 루프를 적용한다.
External verification 도구 없이 양/음 솔루션 신호와 선호 기반 미세조정(DPO 및 KTO)을 사용하여 모델을 정렬한다.
GSM8K 및 기타 벤치마크에서 GPT-4 기반 추출(GPT4-based-Exact-Match)을 이용한 정확도-유사 매 prompt로 평가한다.
더 큰 모델(예: LLama-2-70B, WizardMath-70B, Gemini-Pro, GPT-3.5)과 비교하여 훨씬 적은 데이터로도 경쟁력을 보임을 보여준다.
모델생성 양수 및 음수 샘플의 중요성을 보여주기 위한 제거 실험(ablations)을 보고한다.

실험 결과

연구 질문

RQ17B SLM이 앙상블이나 외부 도구 없이 GSM8K pass@1에서 80%를 초과할 수 있는가?
RQ2수정된 선호 학습(양/음 신호 포함)이 SLM의 수학 추론에 대한 표준 감독 학습과 비교해 어떤 차이를 보이는가?
RQ3에이전트 생성 고품질 합성 데이터가 소형 모델의 수학 추론 성능에 미치는 영향은 무엇인가?
RQ4모델 생성 양수 및 합성 음수 샘플이 학습 효율성에 유의미하게 기여하는가?

주요 결과

모델	기본 모델	모델 크기	정답 형식	평가 방식	GSM8K (%)
Llama-2	7B	nlp	pass@1	pass@1	14.6
Llama-2	13B	nlp	pass@1	pass@1	28.7
Llama-2	34B	nlp	pass@1	pass@1	42.2
Llama-2	70B	nlp	pass@1	pass@1	56.8
MetaMath	Llama-2	7B	nlp	pass@1	66.5
MetaMath	Llama-2	13B	nlp	pass@1	72.3
MetaMath	Llama-2	70B	nlp	pass@1	82.3
WizardMath	Llama-2	7B	nlp	pass@1	54.9
WizardMath	Llama-2	13B	nlp	pass@1	63.9
WizardMath	Llama-2	70B	nlp	pass@1	81.6
MammoTH	Code-Llama	7B	code	pass@1	59.4
MammoTH	Code-Llama	12B	code	pass@1	64.7
MammoTH	Code-Llama	34B	code	pass@1	72.7
MammoTH	Llama-2	70B	nlp	pass@1	76.9
Mistral	7B	7B	nlp	maj1@8	52.2
Mistral	8×7B	-	nlp	maj1@8	58.4
OVM	Llama-2	7B+7B	nlp	verify100@1	73.7
Mistral	7B+7B	-	nlp	verify100@1	84.7
Llemma	7B	7B	nlp	pass@1	36.4
Llemma	34B	34B	nlp	pass@1	51.5
ToRA-Code	7B	7B	code	COT@1	72.6
ToRA-Code	13B	13B	-	COT@1	75.8
ToRA-Code	34B	34B	-	COT@1	80.7
ToRA-Code	70B	70B	-	COT@1	84.3
Orca 2	Llama-2	7B	nlp	pass@1	55.72
Orca 2	Llama-2	13B	nlp	pass@1	65.73
Gemini Pro	-	-	nlp	maj1@32	86.5
GPT-3.5-0613	-	-	code	pass@1	77.4
GPT-4-0613	-	-	-	-	97.0
Phi-1.5	1.3B	code	pass@1	44.6
Phi-GSM	1.5-tiny	125M	code	pass@1	63.1
Phi-GSM	1.5-small	350M	code	pass@1	65.9
Phi-GSM	1.5	1.3B	code	pass@1	68.2
Phi-GSM+V	1.5-tiny+	125M+125M	code	verify48@1	68.9
Phi-GSM+V	1.5-small+	350M+350M	code	verify48@1	71.3
Phi-GSM+V	1.5+	1.3B+1.3B	code	verify48@1	81.5
Orca-Math	Mistral	7B	nlp	pass@1	86.81

지도 학습 미세조정만으로 Orca-Math는 GSM8K pass@1에서 81.50%를 달성한다.
반복적 선호 학습은 검증자나 외부 도구 없이 GSM8K pass@1을 86.81%로 올린다.
Orca-Math(7B, Mistral)는 보고된 설정에서 LLama-2-70B, WizardMath-70B, Gemini-Pro, GPT-3.5 등 대형 모델보다 GSM8K에서 더 나은 성능을 보인다.
200K 합성 데이터 세트가 많은 기준선 대비 현저히 적은 데이터로도 경쟁력 있는 성능을 달성한다.
아블레이션 결과 교사 생성 양수만 사용할 때 성능이 저하되고, 모델 생성 양수 및 음성 신호가 학습에 도움이 됨을 보인다.
Orca-Math는 GPT-4 기반 정확도 매핑으로 GSM8K 이외의 수학 벤치마크(AddSub, ASDiv, MultiArith, SingleOp, SinglEq, Svamp 구조화)에서도 강력한 결과를 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.