Skip to main content
QUICK REVIEW

[论文解读] Evaluating the Feasibility and Accuracy of Large Language Models for Medical History-Taking in Obstetrics and Gynecology

Dandan Liu, Ying Long|ArXiv.org|Mar 31, 2025
Artificial Intelligence in Healthcare and Education被引用 3
一句话总结

该研究评估 ChatGPT-4o 与 ChatGPT-4o-mini 在自动化不孕症病史采集中的表现,发现 4o-mini 在提取完整性方面表现突出,其他指标差异温和。

ABSTRACT

Effective physician-patient communications in pre-diagnostic environments, and most specifically in complex and sensitive medical areas such as infertility, are critical but consume a lot of time and, therefore, cause clinic workflows to become inefficient. Recent advancements in Large Language Models (LLMs) offer a potential solution for automating conversational medical history-taking and improving diagnostic accuracy. This study evaluates the feasibility and performance of LLMs in those tasks for infertility cases. An AI-driven conversational system was developed to simulate physician-patient interactions with ChatGPT-4o and ChatGPT-4o-mini. A total of 70 real-world infertility cases were processed, generating 420 diagnostic histories. Model performance was assessed using F1 score, Differential Diagnosis (DDs) Accuracy, and Accuracy of Infertility Type Judgment (ITJ). ChatGPT-4o-mini outperformed ChatGPT-4o in information extraction accuracy (F1 score: 0.9258 vs. 0.9029, p = 0.045, d = 0.244) and demonstrated higher completeness in medical history-taking (97.58% vs. 77.11%), suggesting that ChatGPT-4o-mini is more effective in extracting detailed patient information, which is critical for improving diagnostic accuracy. In contrast, ChatGPT-4o performed slightly better in differential diagnosis accuracy (2.0524 vs. 2.0048, p > 0.05). ITJ accuracy was higher in ChatGPT-4o-mini (0.6476 vs. 0.5905) but with lower consistency (Cronbach's $α$ = 0.562), suggesting variability in classification reliability. Both models demonstrated strong feasibility in automating infertility history-taking, with ChatGPT-4o-mini excelling in completeness and extraction accuracy. In future studies, expert validation for accuracy and dependability in a clinical setting, AI model fine-tuning, and larger datasets with a mix of cases of infertility have to be prioritized.

研究动机与目标

  • 评估在产科/妇科中自动化不孕症病史采集的可行性(使用大模型)。
  • 比较 ChatGPT-4o 与 ChatGPT-4o-mini 在信息提取与诊断支持方面的表现。
  • 评估病史采集的完整性以及鉴别诊断和不孕类型判断的可靠性。

提出的方法

  • 开发AI驱动的会话系统以模拟医患互动。
  • 处理70个真实世界的不孕症病例以生成420份诊断病史。
  • 使用信息提取的F1分数、鉴别诊断(DDs)准确性和不孕类型判断(ITJ)准确性来评估表现。
  • 在提取、完整性以及诊断指标上比较 ChatGPT-4o 和 ChatGPT-4o-mini。

实验结果

研究问题

  • RQ1LLM基础系统是否能够自动生成准确且完整的不孕症病史?
  • RQ2在不孕症病例中,ChatGPT-4o 与 ChatGPT-4o-mini 在信息提取、DDs准确性和 ITJ准确性方面有何比较?

主要发现

  • ChatGPT-4o-mini 的信息提取准确性更高(F1 0.9258) vs ChatGPT-4o(F1 0.9029),p = 0.045,d = 0.244。
  • ChatGPT-4o-mini 在病史采集的完整性方面更高(97.58%) vs ChatGPT-4o(77.11%)。
  • ChatGPT-4o 的鉴别诊断准确性略高于 ChatGPT-4o-mini(2.0524 vs 2.0048),p > 0.05。
  • ITJ 的准确性在 ChatGPT-4o-mini(0.6476)高于 ChatGPT-4o(0.5905),但一致性较低(Cronbach’s α = 0.562)。
  • 两种模型在自动化不孕症病史采集方面均显示出较强的可行性;4o-mini 在完整性与提取方面表现突出;需要临床验证和更大数据集。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。