Skip to main content
QUICK REVIEW

[論文レビュー] Synthetic Rewriting as a Quality Multiplier: Evidence from Portuguese Continued Pretraining

Thales Sales Almeida, Rodrigo Nogueira|arXiv (Cornell University)|Mar 25, 2026
Natural Language Processing Techniques被引用数 0
ひとこと要約

The paper shows that synthetic document rewriting boosts Portuguese continued pretraining performance, especially when starting from high-quality data and at larger model scales, indicating rewriting acts as a quality multiplier rather than a data quantity substitute.

ABSTRACT

Synthetic data generation through document rewriting has emerged as a promising technique for improving language model pretraining, yet most studies focus on English and do not systematically control for the quality of the source data being rewritten. We present a controlled study of how synthetic rewriting interacts with source data quality in the context of Portuguese continued pretraining. Starting from ClassiCC-PT, a Portuguese corpus annotated with STEM and Educational quality scores, we construct two 10B-token subsets at different quality levels and rewrite each into four styles using a 7B instruction-tuned model, producing approximately 40B tokens of synthetic data per condition. We train two English-centric base models (1.1B and 7B parameters) on each condition and evaluate on PoETa V2, a comprehensive 44-task Portuguese benchmark. At the 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified, while rewriting low-quality data provides only +0.5 NPM. At the 1.1B scale, this interaction is weaker, with unmodified low-quality data performing comparably to rewritten high-quality data. Our results demonstrate that synthetic rewriting acts primarily as a quality multiplier rather than a substitute for data curation, and that this effect is scale-dependent.

研究の動機と目的

  • Assess how synthetic rewriting interacts with source data quality under a fixed token budget for Portuguese continued pretraining.
  • Determine if rewriting acts as a quality multiplier rather than a substitute for data curation.
  • Analyze how model scale (1.1B vs 7B) changes the effectiveness of rewriting across tasks.

提案手法

  • Construct two 10B-token source subsets from ClassiCC-PT: high-quality (STEM/Educational >2.5) and low-quality (0.5–2.0).
  • Rewrite each document into four styles using a 7B instruction-tuned model to produce ~30B tokens per condition, then combine with 10B original tokens (40B total) for training.
  • Pretrain two English-centric base models (1.1B TinyLLaMA and 7B LLaMA-2) on each condition under a fixed budget.
  • Evaluate all models on PoETa V2 (44 Portuguese tasks) using Normalized Performance Metric (NPM).
  • Compare performance across conditions to isolate effects of data quality and rewriting at two model scales.
Figure 1: Average NPM in PoETa V2 for the 7B model across four experimental conditions as a function of training tokens.
Figure 1: Average NPM in PoETa V2 for the 7B model across four experimental conditions as a function of training tokens.

実験結果

リサーチクエスチョン

  • RQ1Does synthetic rewriting amplify the gains from Portuguese continued pretraining when starting from high-quality data versus low-quality data?
  • RQ2How does model scale (1.1B vs 7B) influence the effectiveness of rewriting and its interaction with data quality?
  • RQ3Are the benefits of rewriting uniform across task categories, or concentrated in knowledge-intensive or culturally grounded tasks?
  • RQ4Does rewriting merely increase token diversity, or does data quality drive the observed gains?

主な発見

Condition7B Peak NPM (tokens)1.1B Peak NPM (tokens)
Edu + Rewrites41.0 (30B)15.1 (25B)
Edu38.5 (20B)13.7 (30B)
Non-edu + Rewrites35.8 (20B)13.8 (30B)
Non-edu35.2 (20B)15.1 (20B)
  • At 7B scale, rewriting high-quality data yields a +3.4 NPM gain over the same data unmodified.
  • At 7B scale, rewriting low-quality data yields only +0.5 NPM gain over the same data.
  • At 1.1B scale, the quality-rewriting interaction is weaker and less consistent, with unmodified low-quality data performing comparably to rewritten high-quality data.
  • High-quality rewritten data sustains learning longer (no convergence at 30B tokens observed for edu + rewrites).
  • Category-level analyses show the largest quality effects in Exams and Brazil-specific tasks; Ethics benefits from rewriting across conditions; General Knowledge can be slightly harmed by rewriting; Social Media tasks show high performance with lower-quality data.
(a) Brazil
(a) Brazil

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。