QUICK REVIEW

[논문 리뷰] Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities

Yaping Chai, Haoran Xie|ArXiv.org|2025. 01. 31.

Topic Modeling인용 수 5

한 줄 요약

LLMs를 위한 텍스트 데이터 증강을 Simple, Prompt-based, Retrieval-based, Hybrid 접근 방식으로 포괄적으로 분류하고, 세분성, 후처리, 평가 및 도전과제에 관한 논의를 포함한다.

ABSTRACT

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation by introducing external knowledge to enable them to produce more grounded-truth data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation and Hybrid Augmentation. We summarise the post-processing approaches in data augmentation, which contributes significantly to refining the augmented data and enabling the model to filter out unfaithful content. Then, we provide the common tasks and evaluation metrics. Finally, we introduce existing challenges and future opportunities that could bring further improvement to data augmentation.

연구 동기 및 목표

대형 언어 모델에 데이터 증강이 필요한 이유와 데이터 품질 및 희소성이 성능에 미치는 영향을 설명한다.
LLMs에 사용되는 증강 기법을 체계적으로 네 가지 범주로 분류한다: Simple, Prompt-based, Retrieval-based, 및 Hybrid.
생성, 의역, 번역, 라벨링, 검색 등과 같은 데이터 증강 측면과 세분성(토큰에서 문서 수준까지)을 논의한다.
향후 연구 및 적용을 안내하기 위한 후처리, 평가 지표 및 실용적 도전과제를 강조한다.

제안 방법

프롬프트 복잡도와 검색 모델 복잡도를 반영하여 증강 기법을 네 가지 범주로 분류한다.
각 범주별로 대표적 방법을 요약하고 생성, 의역, 번역, 라벨링, 검색 및 편집에 주의를 기울인다.
토큰 수준에서 문서 수준에 이르는 데이터 증강의 세분성 수준과 그것이 데이터 다양성 및 충실도에 미치는 영향을 설명한다.
증강 데이터의 품질을 정제하고 부정확한 내용을 줄이기 위해 사용되는 후처리 접근법을 제시한다.
증강 효과를 평가하는 데 사용되는 일반적인 작업 및 평가 지표를 개요한다.
향후 연구 방향에 정보를 주기 위한 도전과제와 기회를 식별한다.

실험 결과

연구 질문

RQ1LLMs를 위한 텍스트 데이터 증강의 주요 범주와 이들이 방법론 및 역량에서 어떻게 다른가?
RQ2데이터 증강의 측면(생성, 의역, 번역, 라벨링, 검색, 편집) 및 세분성 수준이 증강 데이터의 품질과 모델 성능에 어떤 영향을 미치는가?
RQ3LLM 맥락에서 증강 데이터에 대해 효과적인 후처리 및 평가 방법은 무엇인가?
RQ4LLMs를 위한 텍스트 데이터 증강의 현재 도전과제와 유망한 기회는 무엇인가?

주요 결과

네 가지 주요 증강 범주가 확인된다: Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation, 및 Hybrid Augmentation.
데이터 증강은 여러 측면(생성, 의역, 번역, 라벨링, 검색, 편집)과 세분성 수준(토큰에서 문서까지)을 포괄한다.
프롬프트 엔지니어링과 검색 기반 보강 기법은 데이터 다양성 및 근거 확보를 함께 향상시키는 반면, 후처리는 환각(hallucinations) 및 충실하지 않은 콘텐츠를 완화하는 데 도움을 준다.
데이터 품질, 사실적 근거 마련, 최신 외부 지식 원천의 필요성과 관련된 지속적인 도전과제가 있으며, 여러 가지 향후 방향이 제안된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.