QUICK REVIEW

[論文レビュー] Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

Thales Sales Almeida, Rodrigo Nogueira|arXiv (Cornell University)|Mar 25, 2026

Natural Language Processing Techniques被引用数 0

ひとこと要約

The paper shows that synthetic document rewriting boosts Portuguese continued pretraining performance, especially when starting from high-quality data and at larger model scales, indicating rewriting acts as a quality multiplier rather than a data quantity substitute.

ABSTRACT

Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

研究の動機と目的

Assess how synthetic rewriting interacts with source data quality under a fixed token budget for Portuguese continued pretraining.
Determine if rewriting acts as a quality multiplier rather than a substitute for data curation.
Analyze how model scale (1.1B vs 7B) changes the effectiveness of rewriting across tasks.

提案手法

Construct two 10B-token source subsets from ClassiCC-PT: high-quality (STEM/Educational >2.5) and low-quality (0.5–2.0).
Rewrite each document into four styles using a 7B instruction-tuned model to produce ~30B tokens per condition, then combine with 10B original tokens (40B total) for training.
Pretrain two English-centric base models (1.1B TinyLLaMA and 7B LLaMA-2) on each condition under a fixed budget.
Evaluate all models on PoETa V2 (44 Portuguese tasks) using Normalized Performance Metric (NPM).
Compare performance across conditions to isolate effects of data quality and rewriting at two model scales.

Figure 1: Average NPM in PoETa V2 for the 7B model across four experimental conditions as a function of training tokens.

実験結果

リサーチクエスチョン

RQ1Does synthetic rewriting amplify the gains from Portuguese continued pretraining when starting from high-quality data versus low-quality data?
RQ2How does model scale (1.1B vs 7B) influence the effectiveness of rewriting and its interaction with data quality?
RQ3Are the benefits of rewriting uniform across task categories, or concentrated in knowledge-intensive or culturally grounded tasks?
RQ4Does rewriting merely increase token diversity, or does data quality drive the observed gains?

主な発見

Condition	7B Peak NPM (tokens)	1.1B Peak NPM (tokens)
Edu + Rewrites	41.0 (30B)	15.1 (25B)
Edu	38.5 (20B)	13.7 (30B)
Non-edu + Rewrites	35.8 (20B)	13.8 (30B)
Non-edu	35.2 (20B)	15.1 (20B)

At 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified.
At 7B scale, rewriting low-quality data yields only +0.5 NPM gain over the same data.
At 1.1B scale, the quality-rewriting interaction is weaker and less consistent, with unmodified low-quality data performing comparably to rewritten high-quality data.
High-quality rewritten data sustains learning longer (no convergence at 30B tokens observed for edu + rewrites).
Category-level analyses show the largest quality effects in Exams and Brazil-specific tasks; Ethics benefits from rewriting across conditions; General Knowledge can be slightly harmed by rewriting; Social Media tasks show high performance with lower-quality data.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。