QUICK REVIEW

[논문 리뷰] Chinese Tiny LLM: Pretraining a Chinese-Centric Large Language Model

Xinrun Du, Zhouliang Yu|arXiv (Cornell University)|2024. 04. 05.

Natural Language Processing Techniques인용 수 5

한 줄 요약

CT-LLM은 2B 매개변수 LLM으로 처음부터 중국 데이터(800B 중국 토큰)로 사전학습하여 강력한 중국능력과 경쟁력 있는 다국어 성능을 달성하고, 오픈소스 데이터, CHC-Bench 평가, SFT/DPO 정렬을 갖춘 모델입니다.

ABSTRACT

In this study, we introduce CT-LLM, a 2B large language model (LLM) that illustrates a pivotal shift towards prioritizing the Chinese language in developing LLMs. Uniquely initiated from scratch, CT-LLM diverges from the conventional methodology by primarily incorporating Chinese textual data, utilizing an extensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques. Demonstrating remarkable performance on the CHC-Bench, CT-LLM excels in Chinese language tasks, and showcases its adeptness in English through SFT. This research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages, broadening the horizons for LLM training methodologies. By open-sourcing the full process of training a Chinese LLM, including a detailed data processing procedure with the obtained Massive Appropriate Pretraining Chinese Corpus (MAP-CC), a well-chosen multidisciplinary Chinese Hard Case Benchmark (CHC-Bench), and the 2B-size Chinese Tiny LLM (CT-LLM), we aim to foster further exploration and innovation in both academia and industry, paving the way for more inclusive and versatile language models.

연구 동기 및 목표

중국 중심의 LLM이 중국 과제에서 영어 중심의 벤치마크를 능가할 수 있음을 입증한다.
고품질의 중국 사전학습 코퍼스(MAP-CC)를 제공하고 데이터 처리 파이프라인을 공개한다.
감독 학습 미세조정(SFT)을 통해 모델의 다국어 적응성과 영어 능력을 보여준다.

제안 방법

1,254.68B-token 혼합물로 CT-LLM을 사전학습하되 800B 중국 토큰, 300B 영어 토큰, 100B 코드 토큰을 포함한다.
32층의 트랜스포머 디코더 아키텍처, 2,048 히든 사이즈, 16 어텐션 헤드, 4,096 토큰 컨텍스트를 사용한다.
효율성을 위해 로터리 위치 임베딩, SwiGLU 활성화, RMSNorm 및 공유 입력-출력 임베딩을 적용한다.
125,696 어휘 크기의 BPE를 가진 중국 토크나이저(baichuan2)와 숫자 토큰화를 위한 숫자 수준 토큰화를 사용한다.
중국어 및 영어 데이터로 SFT를 수행하고 Qwen-7B를 평가자로 하여 perplexity로 필터링한다.
호불호 최적화를 DPO를 통해 중국/영어 혼합 선호 데이터 세트로 HUMAN 선호도에 맞춰 정렬한다.

실험 결과

연구 질문

RQ1중국 중심의 사전학습 체계가 영어 중심 데이터 없이도 강력한 중국어 이해와 생성을 이끌어낼 수 있는가?
RQ2SFT와 DPO 정렬이 CT-LLM의 중국어 및 다국어 능력에 어떤 영향을 미치는가?
RQ3MAP-CC 데이터 전처리가 모델 품질에 어떤 영향을 미치는가?
RQ4CT-LLM은 CHC-Bench에서 중국어 지시 이해 및 추적 능력이 다른 2B 모델에 비해 어떤가?
RQ5CT-LLM-SFT-DPO의 안전성 및 정렬 특성은 기준선과 어떻게 다른가?

주요 결과

CT-LLM은 중국 콘텐츠를 강조하는 데이터 구성을 통해 중국어 능력이 두드러지게 향상된다.
CT-LLM은 다학제적 과제에서 영어-중국 간 격차가 일부 영어 중심 모델에 비해 작아 균형 잡힌 성능을 보인다.
SFT-DPO 정렬은 안전성과 선호도 기반 응답에서 기준선 대비 향상을 보인다.
CT-LLM은 중국어 지시 이행에 대해 CHC-Bench에서 경쟁력 있거나 우수한 성능을 시연한다.
CT-LLM-SFT-DPO는 중국 중심의 사전학습에도 불구하고 영어 벤치마크에서 강한 성능을 유지한다.
실험 결과는 2B 모델이 더 나은 중국어 능력과 경쟁력 있는 다국어 적응성을 보임을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.