[論文レビュー] Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs
The paper introduces Feature Activation Coverage (FAC) to quantify data diversity in an interpretable SAE-based feature space and presents FAC Synthesis, a coverage-guided data generation framework that fills missing task-relevant features to improve post-training performance across multiple tasks and model families.
The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.
研究の動機と目的
- Motivate the need for data diversity in post-training of LLMs and move beyond text-based diversity metrics by focusing on task-relevant features.
- Propose a model-aware diversity metric (FAC) derived from a sparse autoencoder feature space.
- Develop FAC Synthesis to identify and fill missing task-relevant features via targeted data generation.
- Theoretically bound post-training generalization error and connect it to feature coverage and sampling.
- Empirically validate FAC Synthesis across multiple tasks and model families, showing data efficiency and cross-model transfer of SAE features.
提案手法
- Define Feature Activation Coverage (FAC) as the fraction of task-relevant SAE features activated in generated data.
- Train a Sparse Autoencoder (SAE) on model internal activations to obtain interpretable latent features.
- Formulate an objective to minimize the KL-divergence between SAE feature distributions of anchor data and generated data (Eq. 3).
- Identify missing features F_miss as features present in anchor data but not in generated data and guide generation to activate them.
- Use a two-step synthesis strategy: (i) build contrastive feature-aware prompts to create pairs that strongly/weakly activate each missing feature, (ii) generate samples conditioned on these pairs and filter by SAE activation (threshold delta).
- Provide a PAC-Bayesian and mutual-information-based analysis to bound sampling error and relate it to data uncertainty (Section 6).
実験結果
リサーチクエスチョン
- RQ1RQ1: Does coverage-guided synthetic data improve model performance after fine-tuning?
- RQ2RQ2: Are the missing features discovered by SAE related to model performance?
- RQ3RQ3: Are SAE–identified missing features transferable across different language models?
- RQ4RQ4: Are the explanations and syntheses reasonable to humans?
- RQ5RQ5: Is the proposed framework sensitive to the selection of hyper-parameters?
主な発見
- FAC correlates strongly with downstream performance (Pearson r = 0.95, Spearman ρ = 0.90).
- FAC Synthesis achieves comparable performance to MAGPIE with only 2,000 synthetic samples versus MAGPIE’s 150K+ samples.
- Experiments across Toxicity Detection, Reward Modeling, Behavior Steering, and Instruction Following show consistent improvements over baselines.
- SAE features exhibit cross-model transferability, indicating a shared feature space across model families (LLaMA, Mistral, Qwen).
- Two-step synthesis with contrastive prompts yields higher FAC than one-step generation, improving reliability of feature activation.
- Hyper-parameter analysis indicates intermediate decoding temperatures and feature thresholds yield best performance, with data efficiency quantified by DES.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。