QUICK REVIEW

[논문 리뷰] Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

Kunat Pipatanakul, Pittawat Taveekitworachai|arXiv (Cornell University)|2026. 01. 26.

Topic Modeling인용 수 0

한 줄 요약

Typhoon-S는 학술 규모 자원 하에서 태국어 LLM의 채택성( adoptability )과 주권적 능력( sovereign capability )을 가능하게 하는 최소한의 개방형 사후 훈련 레시피(SFT + on-policy distillation 및 InK-GRPO)를 제시한다.

ABSTRACT

Large language models (LLMs) have progressed rapidly; however, most state-of-the-art models are trained and evaluated primarily in high-resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large-scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional- or national-scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general-purpose assistant, and (2) sovereign capability, the ability to perform high-stakes, region-specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large-scale reinforcement fine-tuning (RFT). We present Typhoon S, a minimal and open post-training recipe that combines supervised fine-tuning, on-policy distillation, and small-scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign-adapted and general-purpose base models into instruction-tuned models with strong general performance. We further show that small-scale RFT with InK-GRPO -- an extension of GRPO that augments the GRPO loss with a next-word prediction loss -- improves Thai legal reasoning and Thai-specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post-training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high-quality sovereign LLMs under academic-scale resources.

연구 동기 및 목표

두 가지 주권적 사후 훈련 요건 정의: 채택성(일반 지시 이행)과 주권적 능력(지역 특정 작업).
채택성 달성을 위한 SFT(감독 미세조정)와 OPD(온-정책 증류)를 결합한 최소한의 사후 훈련 레시피를 제안.
다음 토큰 예측을 포함한 확장된 GRPO 손실인 InK-GRPO를 도입해 주권적 능력을 강화한다.
태국어를 사례 연구로 적용해 학술 규모의 컴퓨트에서의 효율성을 보인다.

제안 방법

두 단계의 채택성 파이프라인: 일반 지시 및 도구 사용에 대한 SFT, 이어서 교사 모델로부터의 on-policy distillation(OPD).
태국어 중심의 compact 언어 데이터세트를 구성하고 제약된 AutoIF 스타일 프롬프트를 사용해 타깃 언어 데이터를 보강한다.
단일 노드의 메모리 효율적 OPD 프레임워크에서 교사_logits를 학습 루프에 통합한 전체 로그잇 증류(full-logits distillation; 또는 Top-K와의 비교) 사용.
주권적 능력을 위해 GRPO를 InK-GRPO로 확장하고 도메인 특화 지식과 태국 법적 추론을 개선하기 위해 교차 엔트로피 다음 토큰 손실을 추가한다.
MT-Bench, IFEval, MMLU Pro X (Thai), OpenThaiEval, MATH500 (Thai), LiveCodeBench, BFCL, 및 HotpotQA를 포함한 광범위한 태국어-영어 다국어 벤치마크 스위트를 사용해 평가한다.

Figure 1 : Overview of the target-language dataset construction pipeline for Thai.

실험 결과

연구 질문

RQ1RQ1 SFT만으로 강력한 성능을 얻을 수 있는가, 아니면 강건성을 위해 OPD가 필요한가?
RQ2RQ2 전체 로그잇 증류가 필수인가, 아니면 Top-K 증류가 다양한 작업에서 충분한가?
RQ3RQ3 모든 단계에서 타깃 언어 데이터셋이 필요한가, 그리고 그것이 태국어 작업에 어떤 영향을 미치는가?
RQ4RQ4 주권에 맞춘 기반 모델(ThaiLLM-8B)과 일반 기반 모델에 적용했을 때도 레시피가 작동하는가?

주요 결과

SFT 단독은 전체 SFT+OPD 레시피에 비해 성능이 떨어지며, 태국어 코드 스위칭 및 도구 사용에서 현저한 취약점이 있다.
전체 로그잇 OPD는 일반적으로 Top-K 증류보다 평균 성능이 더 높으며, 특히 태국어 코드 스위칭 작업에서 그렇다.
타깃 언어 데이터는 SFT가 태국어 정렬을 학습하는 데 필수적이며, OPD에서도 주로 태국어 원어 Task를 향상시킨다.
주권에 맞춘 기반(ThaiLLM-8B)에 레시피를 적용하면 태국어 중심의 경쟁력 있는 결과를 얻고 태국어 원어 지표에서 일부 베이스라인을 능가할 수 있다.
Typhoon-S는 영어 능력과의 비슷한 수준을 유지하면서 강한 태국어 특화 성능을 달성하며, 학술 규모 자원으로도 효과적임을 보여준다(8B 모델에서 8-H100으로 약 2일; 4-H100으로 1일).
주권 중심 기반에서 시작할 때, 방법은 현지 언어 강점을 유지하고 태국 맥락에서의 자주적(agentic) 역량을 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.