QUICK REVIEW

[論文レビュー] Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

Jungo Kasai, Yuhei Kasai|arXiv (Cornell University)|Mar 31, 2023

Artificial Intelligence in Healthcare and Education被引用数 50

ひとこと要約

GPT-4 は日本の国家医師免許試験に合格し、ChatGPT や GPT-3 を上回るが、依然として大半の医学生には及ばず、禁忌肢の選択などの制限や日本語トークン化によるコスト増高といった課題を示す。

ABSTRACT

As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years, including the current year. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all six years of the exams, highlighting LLMs' potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.

研究の動機と目的

大型言語モデルが2018年から2023年までの日本の National Medical Practitioners Qualifying Examination (NMPQE) を扱えるかを評価する。
非英語の国別医療試験に対する GPT-3、ChatGPT、GPT-4 の性能を比較する。
日本語の医療文脈における現在のLLM API の領域別・言語特有の制限を特定する。
多様な非英語のLLM評価・開発を促進するため、Igaku QA をネイティブ日本語ベンチマークとして公表する。

提案手法

Igaku QA に対してクローズドブックプロンプトを用いて LLM API（GPT-3、ChatGPT、GPT-4）をベンチマークする。
モデルの回答を促すため、2006年の日本の医療免許試験から3つのインコンテキストの例を使用する。
MCQ 形式と固定数値の質問のため、厳密一致で回答を評価する。
多言語 prompting の効果を検証するため、英語に問題を翻訳する ChatGPT-EN バリアントを含める。
日本語での禁忌肢、トークン化コスト、文脈ウィンドウの制限などの要因を分析する。

Figure 1: Example problem from the Japanese medical licensing exam where ChatGPT chooses a prohibited choice (禁忌肢) because euthanasia is illegal in Japan . Test takers who choose four or more prohibited choices would fail regardless of their exam total scores (§ 2.2 ). The problem and the ChatGPT ou

実験結果

リサーチクエスチョン

RQ1GPT-4 は複数年（2018–2023）にわたる日本の NMPQE を合格できるか？
RQ2GPT-3、ChatGPT、GPT-4 は互いに、そして Igaku QA における医学生の成績とどう比較されるか？
RQ3日本語の医療免許問題に適用した場合の現在のLLMの主な制限は何か（例：禁忌肢、地理的・時系列の文脈、トークン化コスト）？

主な発見

GPT-4 はテスト対象モデルの中で最良の性能を達成する。
GPT-4 は全6年間（2018–2023）の日本の医療免許試験に合格する。
GPT-4 は医学生の多数決の成績には大きく及ばない。
一部の LLM 出力は時折、日本の医療実務で避けるべき禁忌肢を選択する。
日本語のトークン化は API コストを押し上げ、英語と比較して最大文脈ウィンドウを狭くする。
ChatGPT-EN（英語プロンプト/翻訳）はしばしVanilla ChatGPTよりも優れており、明示的な翻訳なしでは多言語の限界を示す。

Figure 2: Standard timeline comparisons of the medical licensing processes in Japan and the United States. The Japanese system involves only one national licensing examination (NMPQE) towards the end of the six-year medical school education. The United States Medical Licensing Examination (USMLE) co

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。