QUICK REVIEW

[论文解读] EvoClinician: A Self-Evolving Agent for Multi-Turn Medical Diagnosis via Test-Time Evolutionary Learning

Yufei He, Juncheng Liu|arXiv (Cornell University)|Jan 30, 2026

Machine Learning in Healthcare被引用 0

一句话总结

EvoClinician 是一个在测试阶段进行学习的智能体，通过 Diagnose-Grade-Evolve 循环在不同病例之间更新提示与记忆来改进多轮医学诊断，在 Med-Inquire 基准上以更低成本实现比基线更高的诊断质量。

ABSTRACT

Prevailing medical AI operates on an unrealistic ''one-shot'' model, diagnosing from a complete patient file. However, real-world diagnosis is an iterative inquiry where Clinicians sequentially ask questions and order tests to strategically gather information while managing cost and time. To address this, we first propose Med-Inquire, a new benchmark designed to evaluate an agent's ability to perform multi-turn diagnosis. Built upon a dataset of real-world clinical cases, Med-Inquire simulates the diagnostic process by hiding a complete patient file behind specialized Patient and Examination agents. They force the agent to proactively ask questions and order tests to gather information piece by piece. To tackle the challenges posed by Med-Inquire, we then introduce EvoClinician, a self-evolving agent that learns efficient diagnostic strategies at test time. Its core is a ''Diagnose-Grade-Evolve'' loop: an Actor agent attempts a diagnosis; a Process Grader agent performs credit assignment by evaluating each action for both clinical yield and resource efficiency; finally, an Evolver agent uses this feedback to update the Actor's strategy by evolving its prompt and memory. Our experiments show EvoClinician outperforms continual learning baselines and other self-evolving agents like memory agents. The code is available at https://github.com/yf-he/EvoClinician

研究动机与目标

推动真实、迭代的临床诊断场景，在信息通过问题与检查收集完整初始数据前提下进行诊断。
引入 Med-Inquire，这是一个将完整病历文件对代理隐藏以评估诊断等级和资源成本的基准。
提出 EvoClinician，这是一个自我演化的智能体，在各病例之间通过行动级反馈更新提示和外部记忆。
证明行动级评分和 TTL 更新能在多种骨干模型上提升诊断质量与效率。
提供消融研究以识别促成性能提升的关键组件。

提出的方法

提出 Med-Inquire 作为一个顺序诊断环境，设有患者与检查门控者以及成本模型。
引入三角色 TTL 循环：Actor（诊断）、Process Grader（行动级信用分配）、Evolver（提示与记忆更新）。
通过 HIGH_YIELD、INEFFICIENT 和 CRITICAL_ERROR 标签启用行动级反馈以引导策略演化。
Evolver 进行基于梯度的外部记忆与 Actor 提示（规则）的更新，来自行动级反馈。
在固定轮次限制下，针对多种 LLM 骨干进行评估，与静态提示、静态记忆和提示优化基线进行比较。
使用成本感知的评分以防止因过度测试而带来的准确性提升，并展示提示与记忆演化的互补效益。

Figure 1 : EvoClinician architecture and test-time learning loop. The Actor interacts with the Med-Inquire environment through AskQuestion and OrderTest , receiving responses from the Patient and Examination agents, while the Cost Estimator tracks resource use. After SubmitDiagnosis , the Judge assi

实验结果

研究问题

RQ1在序贯医学诊断场景中，采用行动级反馈的测试时学习是否能提升诊断准确性和资源效率？
RQ2在病例之间同时演化提示和记忆是否比仅演化一个组件或使用静态基线获得更大提升？
RQ3不同骨干模型如何影响 EvoClinician 和 TTL 基于自适应的收益？
RQ4密集行动级评分对信用分配和长远决策有何影响？
RQ5提示/记忆演化对成本模型及真实世界变异（病例复杂度）是否鲁棒？

主要发现

Method	gemini-3-pro S	gemini-3-pro T	gemini-3-pro C
Static Prompt	48.2	9.8	1380
RAG	50.7	10.3	1490
Mem0	51.2	10.1	1450
Evo-Memory	52.0	10.0	1435
Prompt Optimization Agent (EvoPrompt)	53.6	9.7	1360
GEPA	49.4	10.6	1540
Evolutionary Agent (EvoTest)	57.9	9.4	1605
EvoClinician	59.8	9.1	1275

EvoClinician 在各骨干模型上均取得比基线更高的平均诊断等级（例如 gemini-3-pro 静态提示为 59.8 vs 48.2）。
自我演化方法相比非自适应基线在诊断分数和成本方面更高效，体现 TTL 更新的价值。
行动级评分是关键；若移除将使准确性和成本效率同时下降，相较于仅有逐字稿反馈。
提示演化与记忆演化提供互补提升；两者结合比任一单独使用效果更好。
EvoClinician 的总成本更低、轮次相似或更少，与 EvoTest 相比表示更具针对性更新可减少 Episode 之后的工作量。

Figure 2 : Running-mean learning curves on Med-Inquire over $N=915$ cases (fixed evaluation order). Left: running mean Judge grade $\bar{S}_{1:t}=\frac{1}{t}\sum_{i=1}^{t}S_{i}$ , where $S_{i}\in[0,100]$ is the per-case diagnosis grade. Right: running mean cost $\bar{C}_{1:t}=\frac{1}{t}\sum_{i=1}^{

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。