[論文レビュー] Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining
The paper shows that synthetic document rewriting boosts Portuguese continued pretraining performance, especially when starting from high-quality data and at larger model scales, indicating rewriting acts as a quality multiplier rather than a data quantity substitute.
Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.
研究の動機と目的
- Assess how synthetic rewriting interacts with source data quality under a fixed token budget for Portuguese continued pretraining.
- Determine if rewriting acts as a quality multiplier rather than a substitute for data curation.
- Analyze how model scale (1.1B vs 7B) changes the effectiveness of rewriting across tasks.
提案手法
- Construct two 10B-token source subsets from ClassiCC-PT: high-quality (STEM/Educational >2.5) and low-quality (0.5–2.0).
- Rewrite each document into four styles using a 7B instruction-tuned model to produce ~30B tokens per condition, then combine with 10B original tokens (40B total) for training.
- Pretrain two English-centric base models (1.1B TinyLLaMA and 7B LLaMA-2) on each condition under a fixed budget.
- Evaluate all models on PoETa V2 (44 Portuguese tasks) using Normalized Performance Metric (NPM).
- Compare performance across conditions to isolate effects of data quality and rewriting at two model scales.

実験結果
リサーチクエスチョン
- RQ1Does synthetic rewriting amplify the gains from Portuguese continued pretraining when starting from high-quality data versus low-quality data?
- RQ2How does model scale (1.1B vs 7B) influence the effectiveness of rewriting and its interaction with data quality?
- RQ3Are the benefits of rewriting uniform across task categories, or concentrated in knowledge-intensive or culturally grounded tasks?
- RQ4Does rewriting merely increase token diversity, or does data quality drive the observed gains?
主な発見
| Condition | 7B Peak NPM (tokens) | 1.1B Peak NPM (tokens) |
|---|---|---|
| Edu + Rewrites | 41.0 (30B) | 15.1 (25B) |
| Edu | 38.5 (20B) | 13.7 (30B) |
| Non-edu + Rewrites | 35.8 (20B) | 13.8 (30B) |
| Non-edu | 35.2 (20B) | 15.1 (20B) |
- At 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified.
- At 7B scale, rewriting low-quality data yields only +0.5 NPM gain over the same data.
- At 1.1B scale, the quality-rewriting interaction is weaker and less consistent, with unmodified low-quality data performing comparably to rewritten high-quality data.
- High-quality rewritten data sustains learning longer (no convergence at 30B tokens observed for edu + rewrites).
- Category-level analyses show the largest quality effects in Exams and Brazil-specific tasks; Ethics benefits from rewriting across conditions; General Knowledge can be slightly harmed by rewriting; Social Media tasks show high performance with lower-quality data.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。