QUICK REVIEW

[논문 리뷰] Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Yue Yu, Yuchen Zhuang|arXiv (Cornell University)|2023. 06. 28.

Topic Modeling인용 수 72

한 줄 요약

논문은 AttrPrompt를 제안합니다. LLM과 함께 다양한 속성의 프롬프트를 사용하여 학습 데이터를 생성하고, 단순 프롬프트에 비해 높은 카디널리티 분류 작업에서 더 나은 모델 성능과 편향 감소를 보여주며 데이터 효율이 개선됩니다.

ABSTRACT

Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. Additionally, we present a comprehensive empirical study on data generation encompassing vital aspects like bias, diversity, and efficiency, and highlight three key observations: firstly, synthetic datasets generated by simple prompts exhibit significant biases, such as regional bias; secondly, attribute diversity plays a pivotal role in enhancing model performance; lastly, attributed prompts achieve the performance of simple class-conditional prompts while utilizing only 5\% of the querying cost of ChatGPT associated with the latter. The data and code are available on \url{https://github.com/yueyu1030/AttrPrompt}.

연구 동기 및 목표

프롬프트를 속성화하는 것이(예: 길이, 위치, 스타일) 단순한 클래스-조건 프롬프트(SimPrompt)보다 더 다양하고 정보성이 높은 생성 데이터를 낳는지 조사합니다.
고카디널리티 다도메인 분류 작업에서 LLM이 생성한 데이터의 편향성과 다양성을 정량화합니다.
모델 크기 및 LLM-를 데이터 생성기로 활용하는 접근 방식 전반에 걸친 데이터 효율성(비용) 및 호환성을 평가합니다.
모호성 감소를 위한 속성 차원 식별 및 값 필터링에 대한 반자동 워크플로를 제공합니다.
향후 연구를 촉진하기 위해 생성된 데이터와 프롬프트를 공개합니다.]
method [
Identify attribute dimensions and values per dataset using interactive human-AI collaboration with ChatGPT.
Generate attributed prompts (AttrPrompt) by randomly combining attribute dimensions/values to create diverse prompts.
Compare AttrPrompt against SimPrompt and Gold data using standard fine-tuning of BERT-family classifiers.
Apply Class-Dependent Attribute Value Filtering (CAF) to prevent ambiguity for class-dependent attributes.
Evaluate lexical and structural diversity using metrics like vocabulary size, cosine similarity, APS, and INGF.
Assess data efficiency by comparing performance versus query cost across multiple LLMs and model sizes.
Demonstrate plug-in compatibility by integrating AttrPrompt with other data-generation methods.]
research_questions:[
Does using diversely attributed prompts improve downstream model performance compared to simple class-conditional prompts on high-cardinality tasks?
How do attribute diversity and CAF affect data bias, diversity, and model performance?
What is the cost and data efficiency trade-off when using AttrPrompt versus SimPrompt across different datasets and models?
Can AttrPrompt enhance existing data-generation approaches and benefit multi-label classification?
How does attribute diversity influence performance in low-data versus high-data regimes?]
key_findings:[
AttrPrompt consistently outperforms SimPrompt by a margin of approximately 6–10 points on several datasets.
AttrPrompt achieves similar performance to SimPrompt at only about 5% of the ChatGPT querying cost.
Attribute diversity is crucial; one-fixed-others-random configurations underperform random configurations, and selecting individually best attributes hurts performance.
AttrPrompt yields more balanced attribute distributions than Gold or SimPrompt in location attributes for NYT, reducing regional bias.
Diversity metrics show AttrPrompt has higher lexical diversity than SimPrompt and closer to Gold, though both are below Gold in diversity.
AttrPrompt improves performance when merged with original training data and serves as a valuable plug-in for other data-generation methods.

실험 결과

연구 질문

RQ1 Does using diversely attributed prompts improve downstream model performance compared to simple class-conditional prompts on high-cardinality tasks?
RQ2How do attribute diversity and CAF affect data bias, diversity, and model performance?
RQ3What is the cost and data efficiency trade-off when using AttrPrompt versus SimPrompt across different datasets and models?
RQ4Can AttrPrompt enhance existing data-generation approaches and benefit multi-label classification?
RQ5How does attribute diversity influence performance in low-data versus high-data regimes?

주요 결과

AttrPrompt consistently outperforms SimPrompt by a margin of approximately 6–10 points on several datasets.
AttrPrompt achieves similar performance to SimPrompt at only about 5% of the ChatGPT querying cost.
Attribute diversity is crucial; one-fixed-others-random configurations underperform random configurations, and selecting individually best attributes hurts performance.
AttrPrompt yields more balanced attribute distributions than Gold or SimPrompt in location attributes for NYT, reducing regional bias.
Diversity metrics show AttrPrompt has higher lexical diversity than SimPrompt and closer to Gold, though both are below Gold in diversity.
AttrPrompt improves performance when merged with original training data and serves as a valuable plug-in for other data-generation methods.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.