QUICK REVIEW

[论文解读] Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

Thales Sales Almeida, Rodrigo Nogueira|arXiv (Cornell University)|Mar 25, 2026

Natural Language Processing Techniques被引用 0

一句话总结

论文表明合成文档改写提升葡萄牙语持续预训练性能，尤其在高质量数据起点与更大模型规模时，表明改写起到质量乘数作用而非数据量替代品。

ABSTRACT

Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

研究动机与目标

在固定令牌预算下，评估合成改写在葡萄牙语持续预训练中如何与源数据质量相互作用。
确定改写是否起到质量乘数作用，而非数据筛选的替代品。
分析模型规模（1.1B 与 7B）如何在各任务中改变改写的有效性。

提出的方法

从 ClassiCC-PT 构建两个 10B 令牌的源子集：高质量（STEM/教育 >2.5）和低质量（0.5–2.0）。
使用一个 7B 指令微调模型将每个文档改写为四种风格，使每种条件产出 ~30B 令牌，然后与原始 10B 令牌(总计 40B) 结合用于训练。
在每种条件下对两种英文为主的基础模型进行预训练（1.1B TinyLLaMA 与 7B LLaMA-2）。
在 PoETa V2（44 个葡萄牙语任务）上使用归一化性能指标（NPM）评估所有模型。
在两种模型规模下比较各条件的性能，以隔离数据质量与改写的效应。

Figure 1: Average NPM in PoETa V2 for the 7B model across four experimental conditions as a function of training tokens.

实验结果

研究问题

RQ1在以高质量数据起点与低质量数据起点的情况下，合成改写是否放大葡萄牙语持续预训练的收益？
RQ2模型规模（1.1B vs 7B）如何影响改写的有效性及其与数据质量的相互作用？
RQ3改写的收益是否在任务类别中均匀，还是集中在知识密集型或具有文化背景的任务？
RQ4改写只是增加令牌多样性，还是数据质量驱动了观察到的收益？

主要发现

在 7B 规模下，对高质量数据的改写相比未改写数据获得+3.4 的 NPM 增益。
在 7B 规模下，对低质量数据的改写相比未改写数据仅获得 +0.5 的 NPM 增益。
在 1.1B 规模时，质量与改写的交互较弱且不一致，未改写的低质量数据与改写的高质量数据表现相近。
高质量改写数据能持续学习更长时间（教育+改写组在 30B 令牌时未见收敛）。
类别层面的分析显示，考试和巴西特定任务的质量效应最大；伦理在所有条件下受益于改写；常识性知识被改写略微不利；社交媒体任务在较低质量数据下也保持高性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。