Skip to main content
QUICK REVIEW

[论文解读] Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models

Kunat Pipatanakul, Pittawat Taveekitworachai|arXiv (Cornell University)|Jan 26, 2026
Topic Modeling被引用 0
一句话总结

Typhoon-S 提供一个极简的开源后训练方案(SFT + 具有策略的蒸馏与 InK-GRPO),在学术规模资源下实现泰语语言模型的可采用性与主权能力。

ABSTRACT

Large language models (LLMs) have progressed rapidly; however, most state-of-the-art models are trained and evaluated primarily in high-resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large-scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional- or national-scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general-purpose assistant, and (2) sovereign capability, the ability to perform high-stakes, region-specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large-scale reinforcement fine-tuning (RFT). We present Typhoon S, a minimal and open post-training recipe that combines supervised fine-tuning, on-policy distillation, and small-scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign-adapted and general-purpose base models into instruction-tuned models with strong general performance. We further show that small-scale RFT with InK-GRPO -- an extension of GRPO that augments the GRPO loss with a next-word prediction loss -- improves Thai legal reasoning and Thai-specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post-training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high-quality sovereign LLMs under academic-scale resources.

研究动机与目标

  • 定义两个主权后训练需求:可采用性(通用指令遵循)与主权能力(区域特定任务)
  • 提出一个将监督微调(SFT)与策略蒸馏(OPD)结合的最小后训练方案以实现可采用性
  • 引入 InK-GRPO,一种在 GRPO 损失基础上加入下一词预测的增强版本以提升主权能力
  • 以泰语作为案例研究,展示在学术规模计算下的高效性

提出的方法

  • 两阶段可采用性管线:对通用指令和工具使用进行 SFT,随后从教师模型进行策略蒸馏(OPD)
  • 构建面向泰语的紧凑语言数据集,并使用受限的 AutoIF 风格提示来数据增强目标语言
  • 在单节点、内存高效的 OPD 框架中使用全对数蒸馏(或与 Top-K 的对比),将教师对数融入训练循环
  • 为主权能力,将 GRPO 扩展为 InK-GRPO,通过增加一个交叉熵的下一词损失来提升领域特定知识与泰语法律推理
  • 使用涵盖 MT-Bench、IFEval、MMLU Pro X(Thai)、OpenThaiEval、MATH500(Thai)、LiveCodeBench、BFCL 与 HotpotQA 的广泛泰英多语基准评估
Figure 1 : Overview of the target-language dataset construction pipeline for Thai.
Figure 1 : Overview of the target-language dataset construction pipeline for Thai.

实验结果

研究问题

  • RQ1RQ1 SFT 单独能否达到强性能,还是需要 OPD 以提升鲁棒性?
  • RQ2RQ2 全对数蒸馏是否必需,还是在任务间 Top-K 蒸馏就足够?
  • RQ3RQ3 在每个阶段是否需要目标语言数据集,它对泰语任务有何影响?
  • RQ4RQ4 当基模型为主权适配版本(ThaiLLM-8B)与一般基模型相比,该方案是否仍然有效?

主要发现

  • SFT 单独的表现落后于完整的 SFT+OPD 方案,在泰语代码切换与工具使用上尤其脆弱
  • 全对数蒸馏的 OPD 往往比 Top-K 蒸馏在平均性能上表现更高,尤其在泰语代码切换任务上
  • 目标语言数据对 SFT 学习泰语对齐至关重要,并且在 OPD 下也主要提升泰语本地任务
  • 将该方案应用于主权适配基模型(ThaiLLM-8B)可获得具竞争力的泰语为主的结果,并且在泰语原生指标上可超越部分基线
  • Typhoon-S 在保持英语能力的可比性前提下实现强大的泰语特定性能,且在学术规模资源下有效(约在 8B 模型上使用 8 个 H100 的约 2 天;在 4 个 H100 上约 1 天)
  • 从以主权为核心的基模型开始时,该方法保留本地语言优势并改善泰语情境中的代理能力

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。