QUICK REVIEW

[论文解读] From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Harsha Nori, Naoto Usuyama|arXiv (Cornell University)|Nov 6, 2024

Biomedical and Engineering Education被引用 8

一句话总结

本文评估 OpenAI 的 o1-preview 在医疗基准上的表现，比较其与 Medprompt 增强的 GPT-4，并分析提示策略、推理 token、以及运行时医疗任务中的成本-性能权衡。

ABSTRACT

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

研究动机与目标

评估 o1-preview 在多种医疗基准上的表现，相较于配备 Medprompt 的 GPT-4。
研究在使用推理原生模型时，经典的 Medprompt 提示是否仍然有益。
分析提示策略、推理 token 的使用及集成对性能和成本的影响。
探讨运行时策略中的成本-准确性帕累托前沿是否存在。
讨论推理时计算和未来医学基准开发的影响。

提出的方法

系统性评估 o1-preview 在包括 MedQA、MedMCQA、MMLU (Medical)、NCLEX 和 JMLE-2024 在内的医疗基准上的表现。
将 o1-preview 与 GPT-4 及 GPT-4o（有无 Medprompt 风格策略）进行对比。
考察提示变体（零-shot、少量示例、Medprompt 组件）和集成方法。
分析推理 token 的使用及其对性能的影响。
使用 API token 定价在运行时策略中评估成本与准确性。

实验结果

研究问题

RQ1o1-preview 在相对于具有 Medprompt 提示的 GPT-4 的多种医疗基准上表现如何？
RQ2在像 o1-preview 这样的推理原生模型上，经典的 Medprompt 提示技术是否提供了好处？
RQ3在运行时策略中，推理 token 和集成对准确性和成本有何影响？
RQ4在医疗基准的运行时策略中，是否存在成本-准确性帕累托前沿？
RQ5对推理时计算和医学 AI 的基准开发有哪些影响？

主要发现

o1-preview 在若干医疗基准上常常超过受 Medprompt 指导的 GPT-4，甚至使用简单提示时亦然。
少量示例提示往往降低 o1-preview 的性能，而集成方法在更高成本的情况下提供稳定的准确性提升。
更多的推理 token 通常与 o1-preview 的更高准确性相关，明确的逐步推理（CoT）提示不太推荐。
GPT-4o 提供有利的成本-准确性平衡，在许多任务上可超越较旧的 Medprompt 配置。
o1-preview 在 JMLE-2024 上显示出强劲的非英语医疗推理能力，运行时策略进一步提升了结果。
基准测试在现有医疗基准上接近饱和，凸显需要新的、具有挑战性的任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。