QUICK REVIEW

[論文レビュー] Translating Radiology Reports into Plain Language using ChatGPT and GPT-4 with Prompt Learning: Promising Results, Limitations, and Potential

Qing Lyu, Josh Tan|arXiv (Cornell University)|Mar 16, 2023

Artificial Intelligence in Healthcare and Education参考文献 10被引用数 28

ひとこと要約

この論文は、プロンプト学習を用いたChatGPTとGPT-4を用いて放射線診断報告を平易な言葉に翻訳することを評価し、品質は有望で有用な提案が得られる一方、矛盾や残る限界があると指摘している。

ABSTRACT

The large language model called ChatGPT has drawn extensively attention because of its human-like expression and reasoning abilities. In this study, we investigate the feasibility of using ChatGPT in experiments on using ChatGPT to translate radiology reports into plain language for patients and healthcare providers so that they are educated for improved healthcare. Radiology reports from 62 low-dose chest CT lung cancer screening scans and 76 brain MRI metastases screening scans were collected in the first half of February for this study. According to the evaluation by radiologists, ChatGPT can successfully translate radiology reports into plain language with an average score of 4.27 in the five-point system with 0.08 places of information missing and 0.07 places of misinformation. In terms of the suggestions provided by ChatGPT, they are general relevant such as keeping following-up with doctors and closely monitoring any symptoms, and for about 37% of 138 cases in total ChatGPT offers specific suggestions based on findings in the report. ChatGPT also presents some randomness in its responses with occasionally over-simplified or neglected information, which can be mitigated using a more detailed prompt. Furthermore, ChatGPT results are compared with a newly released large model GPT-4, showing that GPT-4 can significantly improve the quality of translated reports. Our results show that it is feasible to utilize large language models in clinical education, and further efforts are needed to address limitations and maximize their potential.

研究の動機と目的

放射線診断報告を患者と提供者のための平易な言葉へ翻訳する feasibility をChatGPTとGPT-4を用いて評価する。
翻訳の quality と生成された患者/提供者向け提案の有用性を評価する。
プロンプト設計が翻訳の品質に及ぼす影響とプロンプト最適化およびアンサンブル手法の役割を調査する。

提案手法

臨床データベースから胸部CT肺がんスクリーニング報告62件と脳MRIスクリーニング報告76件を収集した。
ChatGPTに3つのプロンプトを適用した：平易な言葉への翻訳、患者向け提案、提供者向け提案。
翻訳を放射線科医の評価（完全性・正確性・全体的品質）と比較した。
同じプロンプトと評価フレームワークを用いてGPT-4とChatGPTを比較した。
翻訳品質に与える影響を評価するため、プロンプト最適化、プロンプト工学のバリエーション、およびアンサンブル翻訳を検討した。

実験結果

リサーチクエスチョン

RQ1ChatGPTとGPT-4は放射線診断報告を正確で患者に優しい平易な言葉へ翻訳できるか。
RQ2放射線科医による評価で、欠落情報や誤解情報の観点から翻訳報告の品質はどうか。
RQ3プロンプトおよびプロンプト最適化は翻訳品質と生成提案の有用性を実質的に改善するか。
RQ4翻訳性能における異なる prompting戦略（アンサンブル手法を含む）はどのように比較されるか。
RQ5臨床運用における制限と潜在的な安全性の考慮事項は何か。

主な発見

group	information missing	incorrect information	overall score
Chest CT	0.097	0.032	4.645
Brain MRI	0.066	0.092	3.961
Overall	0.080	0.065	4.268

ChatGPTの翻訳は、報告された報告に対して5点満点中平均4.268の放射線科医評価を得た。
胸部CTあたりの情報欠落は平均0.080点、脳MRIあたりは0.066点で、平均的な誤情報は0.065点だった。
総じて、胸部CT翻訳のうち76%が5点、脳MRI翻訳のうち32%が5点を獲得（報告された範囲内）。
GPT-4の翻訳は原プロンプト・最適化プロンプトのいずれにしてもChatGPTを大幅に上回り、いくつかの条件でほぼ完璧に近い結果に達した（例：最適化プロンプトで良好が96.8%）。
最適化されたプロンプトは、曖昧なプロンプトと比較して完全性を大幅に改善し、省略と誤解を減らした（例：良好な翻訳の割合が55.2%から77.2%へ向上）。
患者または提供者向けの特定の報告ベースの提案を得られたケースは約37%であり、多くの提案は一般的で適切だった（例：医師とのフォローアップ、所見の伝達など）。
プロンプト工学とアンサンブル手法は、多くのシナリオで最適化されたプロンプトに比べて有意義な改善を提供せず、時には過度の単純化や小さな欠落を生むこともあった。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。