QUICK REVIEW

[論文レビュー] Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

Kunat Pipatanakul, Pittawat Taveekitworachai|arXiv (Cornell University)|Jan 26, 2026

Topic Modeling被引用数 0

ひとこと要約

Typhoon-Sは、学術規模のリソースの下でタイ語LLMの適用性と主権能力を実現するための、最小限の開放的後訓練レシピ（SFT + オンポリシー蒸留および InK-GRPO）を提供します。

ABSTRACT

Large language models (LLMs) have progressed rapidly; however, most state-of-the-art models are trained and evaluated primarily in high-resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large-scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional- or national-scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general-purpose assistant, and (2) sovereign capability, the ability to perform high-stakes, region-specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large-scale reinforcement fine-tuning (RFT). We present Typhoon S, a minimal and open post-training recipe that combines supervised fine-tuning, on-policy distillation, and small-scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign-adapted and general-purpose base models into instruction-tuned models with strong general performance. We further show that small-scale RFT with InK-GRPO -- an extension of GRPO that augments the GRPO loss with a next-word prediction loss -- improves Thai legal reasoning and Thai-specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post-training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high-quality sovereign LLMs under academic-scale resources.

研究の動機と目的

二つの主権後訓練要件を定義する：適用性（一般的な指示遵守）と主権能力（地域特有タスク）。
SFT（教師ありファインチューニング）とオンポリシー蒸留（OPD）を組み合わせて適用性を達成する、最小限の後訓練レシピを提案する。
InK-GRPO の導入：次トークン予測を追加した拡張GRPO損失で主権能力を強化する。
タイ語をケーススタディとして実証し、学術規模の計算資源での効率性を示す。

提案手法

二段階の適用性パイプライン：一般指示とツール使用に対するSFT、続いて教師モデルからのOPD。
対象言語データを含む compact なタイ語重視言語データセットを構築し、制約付き AutoIF風プロンプトを用いてターゲット言語データを拡張。
単一ノード内でメモリ効率的なOPDフレームワークを用い、全ロジット蒸留を実施（またはTop-Kとの比較）、教師ロジットを訓練ループに統合。
主権能力のため、GRPOを InK-GRPO に拡張：ドメイン固有知識とタイ法的推論を向上させるクロスエントロピー次トークン損失を追加。
評価は、MT-Bench、IFEval、MMLU Pro X（Thai）、OpenThaiEval、MATH500（Thai）、LiveCodeBench、BFCL、HotpotQA などを含む広範なタイ語-英語多言語ベンチマークで実施。

Figure 1 : Overview of the target-language dataset construction pipeline for Thai.

実験結果

リサーチクエスチョン

RQ1RQ1 SFT 単独で強い性能を達成できるか、それとも頑健性のためにOPD が必要か？
RQ2RQ2 全ロジット蒸留は必須か、それともタスク横断で Top-K 蒸留で十分か？
RQ3RQ3 すべての段階でターゲット言語データセットが必要か、タイ語タスクへどう影響するか？
RQ4RQ4 主権適応ベース（ThaiLLM-8B）を基盤とした場合と一般ベースモデルを基盤とした場合でレシピは機能するか？

主な発見

SFT 単独は full SFT+OPD レシピと比較して性能が劣り、タイ語のコードスイッチングやツール使用時には特に脆さを示す。
全ロジット蒸留を伴うOPDは、Top-K 蒸留よりも平均性能が高い傾向があり、特にタイ語のコードスイッチングタスクで顕著。
ターゲット言語データは SFT がタイ語整列を学ぶために必須であり、OPD 下でも主にタイ語ネイティブタスクを向上させる。
主権適応ベース（ThaiLLM-8B）へレシピを適用すると、タイ語中心の結果が競争力を持ち、タイ語ネイティブ指標の一部ベースラインを上回ることがある。
Typhoon-S は英語能力と同等レベルを維持しつつタイ語特有の性能を強化し、学術規模のリソースでの有効性を示す（約 2 日間で 8B モデル、8-H100s；4-H100s で 1 日）。
主権志向ベースから開始した場合、現地言語の強みを保持しつつタイ語文脈での主体的能力を向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。