QUICK REVIEW

[論文レビュー] Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

Veniamin Veselovsky, Manoel Horta Ribeiro|arXiv (Cornell University)|May 24, 2023

Topic Modeling被引用数 13

ひとこと要約

この研究は prompting 戦略— grounding、 taxonomy-based generation、および filtering—を検証し、大規模言語モデルからの合成データをサルカズム検出においてより忠実にすることを目指し、ベースラインおよび実データと比較します。

ABSTRACT

Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.

研究の動機と目的

NLP における忠実な合成データの必要性を LLMs で喚起する。
忠実度を高めるための prompting 戦略（grounding、taxonomy-based generation）を提案する。
合成データで分類器を訓練し、皮肉検出の実データで評価して戦略を評価する。
grounding の効果を他の戦略と比較し、実用的な推奨を提供する。

提案手法

SemEval-2022 Task 6 の皮肉データを訓練・評価に用いる。
複数の prompting 戦略の下で ChatGPT を用いて合成データを生成する。
合成データで E5-base モデルを微調整し、実データのテストセットで評価する。
合成データをフィルタリングする識別器を訓練し、真の実データと比較する。
生成データと実データの分類器を比較して信憑性を評価する。

Figure 1: Depiction of the proposed strategies to increase the faithfulness of synthetically generated data. On the left-hand side, we depict different prompting strategies: asking an LLM to generate synthetic data with a simple prompt ( Simple) ; grounding the synthetic data generation with real-wo

実験結果

リサーチクエスチョン

RQ1 grounding、 taxonomy-based generation、および filtering は LLM 生成の合成データを皮肉検出の忠実度向上に寄与するか？
RQ2これらの戦略は下流の性能を用いた簡易 prompting および zero-shot ChatGPT とどのように比較されるか？
RQ3これらの戦略が、実データで評価した場合の分類器の macro-F1 および信憑性に与える影響は？
RQ4topics の多様性と関心の対象となる構成を捉えるのに、grounding が最も効果的な戦略か？

主な発見

Prompting Strategy	Sarcasm	Accuracy	Macro-F1	Believability
シンプル	0.71	0.71	0.48	0.04
グラウンディング	0.67	0.67	0.55	0.13
グラウンディング（書き換え）	0.70	0.70	0.55	0.15
グラウンディング＋タクソノミー	0.67	0.67	0.51	0.20
グラウンディング＋フィルタリング	0.27	0.27	0.26	0.56
真の注釈データ	0.72	0.72	0.60	0.95
すべて非皮肉	0.77	0.77	0.43	—
ゼロショット ChatGPT	0.60	0.60	0.59	—

Grounding ベースの prompting が、synthetic-data で訓練した分類器の macro-F1 で最も高く (0.55)。
単純な prompting はトピックと構成の多様性が限られるため最悪のパフォーマンス（macro-F1 0.48）。
Grounding + Taxonomy は macro-F1 0.51 となり、grounding のみと単純 prompting の間。
Grounding + Filtering は判別器の制約によりパフォーマンスが低下し macro-F1 0.26。
ゼロショット ChatGPT の macro-F1 は 0.60 で、より小規模な prompting で合成データを訓練したモデルを上回る。
信憑性は手法により異なり、Grounding + Filtering が最高の信憑性 (0.56) を達成する一方、Groundtruth annotations は 0.95、Simple は低いまま (0.04)。

Figure 2: Our prompting approach consists of four modular steps. (1) Initiate the model to generate an initial set of 10 data points. (2) Apply a grounding technique as the model generates these 10 data points. (3) Further augment the grounding process by providing the model with an initial taxonomy

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。