[论文解读] Improving Training Efficiency and Reducing Maintenance Costs via Language Specific Model Merging
该论文将语言特定模型合并作为再训练多语言大型语言模型的高效替代方案,显示在多个任务和数据集上质量相近,同时显著降低了训练时间和维护成本。
Fine-tuning a task-specific multilingual large language model (LLM) involves training the model on a multilingual dataset with examples in all the required languages. Updating one or more supported languages with additional data or adding support for a new language involves retraining the model, which can be computationally inefficient and creates a severe maintenance bottleneck. Recent research on merging multilingual multitask models has shown promise in terms of improved quality, but its computational and maintenance efficiency remains unstudied. In this work, we provide the first focused analysis of this merging strategy from an efficiency perspective, evaluating it across three independent tasks. We demonstrate significant efficiency gains while maintaining parity in terms of quality: this merging approach reduces the initial training time by up to 50\%. We also demonstrate that updating an individual language and re-merging as part of model maintenance reduces training costs by more than 60\%, compared to re-training the full multilingual model. We show this on both public and proprietary industry datasets confirming that the approach works well for industrial use cases in addition to academic settings already studied in previous work.
研究动机与目标
- Motivate the high cost and maintenance bottlenecks of fine-tuning multilingual LLMs in enterprise settings.
- Propose and evaluate language-specific model merging as a more efficient alternative to retraining on a combined multilingual dataset.
- Quantify training time and cost savings across multiple tasks and languages.
- Assess robustness across public and proprietary datasets to validate industrial applicability.
提出的方法
- Employ three merging techniques (TIES, DARE, KnOTS) to create language-specific adapters and merge them into a single multilingual model.
- Fine-tune base Llama-3.1-8b-Instruct with LoRA on five languages for three tasks (Summarization, Commonsense Reasoning, Sentiment) and compare against COMB and INDV baselines.
- Experiment with hyperparameters (weighting, density) to generate eight merged models per task.
- Evaluate with task-specific metrics (ROUGE-1, ROUGE-L, BertScore for summarization; accuracy for reasoning; macro F1, precision, recall for sentiment).
- Compare training time and cost between the traditional retrain-all approach and the train-once, merge-as-needed approach, including maintenance scenarios where only language adapters are updated.

实验结果
研究问题
- RQ1Does language-specific model merging achieve parity with or gains over the retrain-all multilingual baseline in terms of task performance?
- RQ2What are the relative training time and cost savings when using language-specific merging versus retraining on the combined multilingual dataset?
- RQ3How do merging techniques perform across different tasks (summarization, reasoning, sentiment) and languages (EN, DE, FR, JA, ZH)?
- RQ4What is the impact of updating a single language adapter on overall merged model performance and maintenance efficiency?
- RQ5Do results generalize to smaller models and proprietary datasets?
主要发现
| Phase | Model | Training Time | Training Cost |
|---|---|---|---|
| Initial Setup | Combined Model | 3.4h | $113.4 |
| Initial Setup | Merged Model | 2.2h (35.3% down) | $107.1 (5.6% down) |
| Update/Add Language | Combined Model | 3.8h | $119.7 |
| Update/Add Language | Merged Model | 1.0h (73.7% down) | $31.5 (73.7% down) |
| Case Study Initial Setup | Combined Model (Case Study) | 45h | $1416 |
| Case Study Initial Setup | Merged Model (Case Study) | 22.5h (50% down) | $1400 (1.1% down) |
| Case Study Update/Add Language | Combined Model (Case Study) | 54.5h | $1717 |
| Case Study Update/Add Language | Merged Model (Case Study) | 20.5h (62.4% down) | $645 (62.4% down) |
- Merged models attain comparable performance to the combined training baseline across tasks, with some languages showing improvements (notably in summarization and reasoning).
- Training time during initial setup reduces by up to 35%, and maintenance-time reductions exceed 70% when updating individual languages and re-merging, compared to re-training the full multilingual model.
- For summarization, several merged configurations (e.g., TIES-KnOTS, DARE-TIES-KnOTS) outperform baselines on English, Japanese, and Chinese; BertScore gains range from 0.1 to 0.6 percentage points.
- For reasoning, merged models are typically on par with baselines, with occasional improvements of up to ~2.2 percentage points in accuracy; German and French sometimes prefer baselines.
- For sentiment analysis, the combined model often performs best, with some merged configurations still outperforming individual-language baselines in certain languages.
- Ablation studies show that updating a single language adapter (e.g., EN) can improve overall merged performance and propagate gains to other languages; model size experiments indicate merging is viable across 8b and 3b LLMs, with some performance variation by size

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。