QUICK REVIEW

[论文解读] VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Chonghan Liu, Yimin Du|arXiv (Cornell University)|Mar 19, 2026

Natural Language Processing Techniques被引用 0

一句话总结

VEPO 引入一种带可验证奖励的变熵强化学习框架，以提升低资源语言的分词、翻译质量和输出可靠性，同时保持通用推理能力。

ABSTRACT

Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

研究动机与目标

解决低资源语言的分词低效和数据不平衡问题。
开发一个分词器增强的持续预训练管线，以提升子词效率。
引入可变熵机制，在翻译中平衡字面保真与语义自然性。
结合带可验证奖励的强化学习（RLVR），在训练中强制实现确定性的结构约束。
在 FLORES-200 方向上展示最先进的翻译性能，同时保持通用推理能力。

提出的方法

利用分词器驱动的持续预训练，通过扩展语言特定标记来扩大词汇表（Qwen2.5-7B 到 Qwen2.5-7B-8Langs）。
进行1:1 英语到低资源语料的平衡多语言训练，以防止遗忘。
通过对双语和多语言数据的三阶段课程进行有监督微调实现训练后对齐。
使用带动态熵正则化和非对称裁剪的剪裁代理损失的变熵策略优化（VEPO）。
基于 RLVR 的轨迹过滤，修剪语言上存在病态的样本并强制约束。
采用熵感知、温度一致的策略更新，结合分词贡献平衡和高效通信的优势归一化。

实验结果

研究问题

RQ1通过扩展分词器的分词改进是否能减少低资源脚本中的子词碎片化？
RQ2VEPO 的变熵机制是否能在多语言翻译中有效权衡字面保真与语义自然性？
RQ3RLVR 强制优化是否能稳定训练并在不牺牲通用推理能力的情况下提高输出确定性？
RQ4VEPO 在 FLORES-200 方向上的 BLEU、COMET、chrF 性能如何，与以翻译为中心的基线相比？
RQ5VEPO 对输出长度控制及减少冗长偏见的影响如何？

主要发现

Model	X → E (BLEU/COMET/chrF)	E → X (BLEU/COMET/chrF)	X → X (BLEU/COMET/chrF)	Avg. (BLEU/COMET/chrF)
VEPO-7B (Full)	36.1/.881/62.7	32.7/.882/56.2	23.1/.854/48.8	24.9/.859/50.9
VEPO-7B w/o CPT	33.3/.862/56.8	31.7/.863/51.8	21.4/.822/43.6	23.7/.837/46.9
VEPO-7B-SFT	35.4/.875/59.8	32.0/.875/52.9	22.7/.839/44.5	24.3/.849/48.3

在语言一致性、长度、格式和混排方面，VEPO 实现了高水平约束满足（表1 总体 95.3%）。
VEPO（完整版本）在 FLORES-200 方向上达到开放源代码的 7B 模型翻译的最新水平（平均 BLEU 24.9，平均 COMET 0.859，平均 chrF 50.9）。
分词 CPT 加上 VEPO 相对于非 CPT 基线显示显著提升，翻译基准中的 Delta 提升可见。
VEPO 在保持通用推理基准（BBH、CMMLU、HellaSwag、MMLU）方面与 SFT 基线相当或优于，表明指令跟随能力得以保留。
人工评估表明 VEPO 的翻译在多对语言对中更受青睐，语义准确性与自然释义的一致性得到认可。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。