QUICK REVIEW

[논문 리뷰] From Medprompt to o1: Exploration of Run-Time Strategies for Medical Challenge Problems and Beyond

Harsha Nori, Naoto Usuyama|arXiv (Cornell University)|2024. 11. 06.

Biomedical and Engineering Education인용 수 8

한 줄 요약

이 논문은 OpenAI의 o1-preview를 의료 벤치마크에서 평가하고 Medprompt-enhanced GPT-4와 비교하며, 의료 작업의 런타임 추론에 대한 프롬프트 전략, 추론 토큰, 비용-성능 균형을 분석한다.

ABSTRACT

Run-time steering strategies like Medprompt are valuable for guiding large language models (LLMs) to top performance on challenging tasks. Medprompt demonstrates that a general LLM can be focused to deliver state-of-the-art performance on specialized domains like medicine by using a prompt to elicit a run-time strategy involving chain of thought reasoning and ensembling. OpenAI's o1-preview model represents a new paradigm, where a model is designed to do run-time reasoning before generating final responses. We seek to understand the behavior of o1-preview on a diverse set of medical challenge problem benchmarks. Following on the Medprompt study with GPT-4, we systematically evaluate the o1-preview model across various medical benchmarks. Notably, even without prompting techniques, o1-preview largely outperforms the GPT-4 series with Medprompt. We further systematically study the efficacy of classic prompt engineering strategies, as represented by Medprompt, within the new paradigm of reasoning models. We found that few-shot prompting hinders o1's performance, suggesting that in-context learning may no longer be an effective steering approach for reasoning-native models. While ensembling remains viable, it is resource-intensive and requires careful cost-performance optimization. Our cost and accuracy analysis across run-time strategies reveals a Pareto frontier, with GPT-4o representing a more affordable option and o1-preview achieving state-of-the-art performance at higher cost. Although o1-preview offers top performance, GPT-4o with steering strategies like Medprompt retains value in specific contexts. Moreover, we note that the o1-preview model has reached near-saturation on many existing medical benchmarks, underscoring the need for new, challenging benchmarks. We close with reflections on general directions for inference-time computation with LLMs.

연구 동기 및 목표

다양한 의료 벤치마크에서 o1-preview의 성능을 Medprompt를 활용한 GPT-4와 비교 평가한다.
추론 네이티브 모델인 경우에도 전통적인 Medprompt 프롬프트가 여전히 이로운지 조사한다.
프롬프트 전략, 추론 토큰 사용, 앙상블이 성능과 비용에 미치는 영향을 분석한다.
런타임 전략 간의 비용-정확도 파레토 프런티어의 존재 여부를 탐색한다.
의학 분야의 추론 시간 계산 및 향후 벤치마크 개발에 대한 시사점을 논의한다.

제안 방법

MedQA, MedMCQA, MMLU (Medical), NCLEX, JMLE-2024를 포함한 의료 벤치마크에서 체계적으로 o1-preview를 평가한다.
Medprompt 스타일 전략의 여부와 관계없이 o1-preview를 GPT-4 및 GPT-4o와 비교한다.
제로샷, 파샷, Medprompt 구성 요소 및 앙상블 접근법 등 프롬프트 변형을 검사한다.
추론 토큰 사용 및 성능에 대한 영향을 분석한다.
API 토큰 가격 책정을 이용하여 런타임 전략들 간의 비용 대 정확도를 평가한다.

실험 결과

연구 질문

RQ1Medprompt 프롬 prompting을 가진 GPT-4와 비교할 때 o1-preview가 다양한 의료 벤치마크에서 어떤 성능을 보이는가?
RQ2추론 네이티브 모델인 o1-preview에 전통적인 Medprompt 프롬프트 기법이 이점을 제공하는가?
RQ3런타임 전략에서 추론 토큰과 앙상블이 정확도와 비용에 미치는 영향은 무엇인가?
RQ4의료 벤치마크를 위한 런타임 전략들 간에 비용-정확도 파레토 프런티어가 존재하는가?
RQ5의료 AI의 추론 시간 계산 및 벤치마크 개발에 대한 시사점은 무엇인가?

주요 결과

o1-preview는 간단한 프롬 prompting에도 불구하고 여러 의료 벤치마크에서 Medprompt로 안내된 GPT-4를 자주 능가한다.
파샷 프롬 prompting은 보통 o1-preview 성능에 해를 주는 반면, 앙상블은 더 높은 비용에서 일관된 정확도 향상을 제공한다.
더 많은 추론 토큰은 일반적으로 o1-preview의 정확도와 양의 상관관계를 보이며, 명시적 CoT 프롬 prompting은 덜 권장된다.
GPT-4o는 유리한 비용-정확도 균형을 제공하며 많은 작업에서 구형 Medprompt 구성들을 능가할 수 있다.
o1-preview 모델은 JMLE-2024에서 강력한 비영어권 의학 추론을 보여주며, 런타임 전략이 결과를 더 향상시킨다.
벤치마크는 기존 의료 벤치마크에서 거의 포화 상태를 보이며, 새롭고 도전적인 과제의 필요성을 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.