QUICK REVIEW

[论文解读] Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities

Yaping Chai, Haoran Xie|ArXiv.org|Jan 31, 2025

Topic Modeling被引用 5

一句话总结

对大型语言模型文本数据增强进行综合性综述，分为简单型、基于提示的、基于检索的和混合方法，并讨论粒度、后处理、评估和挑战。

ABSTRACT

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation by introducing external knowledge to enable them to produce more grounded-truth data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation and Hybrid Augmentation. We summarise the post-processing approaches in data augmentation, which contributes significantly to refining the augmented data and enabling the model to filter out unfaithful content. Then, we provide the common tasks and evaluation metrics. Finally, we introduce existing challenges and future opportunities that could bring further improvement to data augmentation.

研究动机与目标

解释为何需要对大型语言模型进行数据增强，以及数据质量与稀缺性如何影响性能。
系统地将用于 LLM 的增强技术分为四类：简单型、基于提示的、基于检索的和混合。
讨论数据增强的方面（生成、改写、翻译、标注、检索、编辑等）以及粒度（从 token 到文档级别）。
突出后处理、评估指标以及实际挑战，以指导未来的研究与应用。

提出的方法

将增强技术分为四类，反映提示复杂度和检索模型复杂度。
在每个类别下总结代表性方法，关注生成、改写、翻译、标注、检索和编辑等方面。
描述从 token 到文档级别的数据增强粒度及其对数据多样性和保真度的影响。
介绍用于提升增强数据质量并减少不可信内容的后处理方法。
概述用于评估增强效果的常见任务和评估指标。
明确挑战与机遇，以指导未来的研究方向。

实验结果

研究问题

RQ1大型语言模型的文本数据增强主要有哪些类别，它们在方法学和能力上有何不同？
RQ2数据增强的各个方面（生成、改写、翻译、标注、检索、编辑）以及粒度水平如何影响增强数据质量和模型性能？
RQ3在 LLM 场景中，哪些后处理和评估做法对增强数据有效？
RQ4在 LLM 的文本数据增强方面有哪些当前挑战和有前景的机会？

主要发现

识别出四个主要的增强类别：简单增强、基于提示的增强、基于检索的增强和混合增强。
数据增强涵盖多个方面（生成、改写、翻译、标注、检索、编辑）和粒度水平（从 token 到文档）。
提示工程和基于检索的技术共同提升数据的多样性与 grounding，同时后处理有助于缓解幻觉和不可信内容。
在数据质量、事实性锚定以及对最新外部知识来源的需求方面存在持续挑战，并提出了若干未来方向。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。