QUICK REVIEW

[论文解读] Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

Seyed Amir Ahmad Safavi‐Naini, Shuhaib Ali|arXiv (Cornell University)|Aug 25, 2024

Machine Learning in Healthcare被引用 5

一句话总结

研究评估LLMs和VLMs在胃肠科板式题目上的医学推理，比较专有、开源和量化模型，有无图像，以及提示。

ABSTRACT

Background and Aims: This study evaluates the medical reasoning performance of large language models (LLMs) and vision language models (VLMs) in gastroenterology. Methods: We used 300 gastroenterology board exam-style multiple-choice questions, 138 of which contain images to systematically assess the impact of model configurations and parameters and prompt engineering strategies utilizing GPT-3.5. Next, we assessed the performance of proprietary and open-source LLMs (versions), including GPT (3.5, 4, 4o, 4omini), Claude (3, 3.5), Gemini (1.0), Mistral, Llama (2, 3, 3.1), Mixtral, and Phi (3), across different interfaces (web and API), computing environments (cloud and local), and model precisions (with and without quantization). Finally, we assessed accuracy using a semiautomated pipeline. Results: Among the proprietary models, GPT-4o (73.7%) and Claude3.5-Sonnet (74.0%) achieved the highest accuracy, outperforming the top open-source models: Llama3.1-405b (64%), Llama3.1-70b (58.3%), and Mixtral-8x7b (54.3%). Among the quantized open-source models, the 6-bit quantized Phi3-14b (48.7%) performed best. The scores of the quantized models were comparable to those of the full-precision models Llama2-7b, Llama2--13b, and Gemma2-9b. Notably, VLM performance on image-containing questions did not improve when the images were provided and worsened when LLM-generated captions were provided. In contrast, a 10% increase in accuracy was observed when images were accompanied by human-crafted image descriptions. Conclusion: In conclusion, while LLMs exhibit robust zero-shot performance in medical reasoning, the integration of visual data remains a challenge for VLMs. Effective deployment involves carefully determining optimal model configurations, encouraging users to consider either the high performance of proprietary models or the flexible adaptability of open-source models.

研究动机与目标

使用板式问题（300道，含138道图像）评估LLMs和VLMs在胃肠科的医学推理表现。
系统性比较专有、开源和量化模型的配置差异。
评估图像及字幕对VLM/LLM性能的影响。
探索提示语、界面和计算环境对模型准确性的影响。

提出的方法

使用300道胃肠科板式选择题（含138道图像）来测试模型性能。
评估多种模型族：GPT（3.5、4、4o、4omini）、Claude（3、3.5）、Gemini（1.0）、Mistral、Llama（2、3、3.1）、Mixtral、Phi（3）。
跨界面测试（网页、API）、计算环境（云端、本地）及精度（量化与全精度）。
使用半自动化流水线评估准确性。

实验结果

研究问题

RQ1专有LLM与开源LLM在胃肠科问题上的准确性对比如何？
RQ2图像内容对VLM/LLM性能有何影响，字幕说明是否有帮助？
RQ3模型量化对性能相对于全精度模型有何影响？
RQ4哪些模型配置与提示能最大化胃肠科的医学推理准确性？

主要发现

专有模型中，GPT-4o达到73.7%准确率，Claude3.5-Sonnet达到74.0%准确率。
顶级开源模型达到64%（Llama3.1-405b）和58.3%（Llama3.1-70b）。
量化 Phi3-14b（6位）达到48.7%准确率，与全精度的 Llama2-7b、Llama2-13b、Gemma2-9b 相当。
带有图像的问题上，VLM的表现未因图像或LLM生成的字幕而提升；只有人类创作的图像描述带来约10%的准确率提升。
总体而言，LLMs展现出强劲的零样本医学推理能力，但 VLMs 的视觉数据整合仍然具有挑战性。
该研究为在高性能专有模型与可适应的开源选项之间的选择提供了指导。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。