QUICK REVIEW

[论文解读] Utilizing Large Language Models to Generate Synthetic Data to Increase the Performance of BERT-Based Neural Networks

Chancellor Woolsey, P. S. Bisht|PubMed|May 8, 2024

Topic Modeling参考文献 8被引用 6

一句话总结

该论文研究使用大型语言模型生成的合成数据来增强基于BERT的自闭症相关行为分类器，评估数据质量及对模型性能的影响。数据增强提高了召回率，但降低了精确度，抽样对的临床医生验证质量约为83%。

ABSTRACT

An important problem impacting healthcare is the lack of available experts. Machine learning (ML) models may help resolve this by aiding in screening and diagnosing patients. However, creating large, representative datasets to train models is expensive. We evaluated large language models (LLMs) for data creation. Using Autism Spectrum Disorders (ASD), we prompted GPT-3.5 and GPT-4 to generate 4,200 synthetic examples of behaviors to augment existing medical observations. Our goal is to label behaviors corresponding to autism criteria and improve model accuracy with synthetic training data. We used a BERT classifier pretrained on biomedical literature to assess differences in performance between models. A random sample (N=140) from the LLM-generated data was also evaluated by a clinician and found to contain 83% correct behavioral example-label pairs. Augmenting the dataset increased recall by 13% but decreased precision by 16%. Future work will investigate how different synthetic data characteristics affect ML outcomes.

研究动机与目标

动机：在机器学习模型中使用合成数据，以解决稀缺的专业人员标注的医学数据问题。
评估是否可以用LLM生成的观察来对自闭症标准进行标注，以增补训练数据。
评估合成数据对在生物医学文献上训练的BERT分类器的影响。
提供临床医生支持的对一组合成标签样本的质量检查，以评估真实感。

提出的方法

提示LLMs（ChatGPT 与 GPT-Premium）以创建4,200个标注自闭症标准的合成观察。
使用经过生物医学领域预训练的BERT分类器评估与数据增强相关的性能差异。
随机抽取140个合成观察供临床医生评估，以估计标签准确性（83%正确）。
在将合成数据加入训练集时，衡量召回率和精确度的变化。

实验结果

研究问题

RQ1LLM生成的合成数据是否可以提升自闭症相关行为标注的分类器性能？
RQ2在临床医生评估时，LLM生成样例的质量（标签准确率）是多少？
RQ3增强的合成数据如何影响基于BERT的模型的关键性能指标（召回率、精确度）？

主要发现

数据增广为合成观察提高召回率13%。
增广数据将精确度降低16%。
对随机样本（N=140）的临床评估显示83%正确的样例-标签对。
合成数据的质量因特征而异，影响机器学习结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。