QUICK REVIEW

[论文解读] RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction

Hanbum Ko, Chanhui Lee|arXiv (Cornell University)|Mar 13, 2026

Machine Learning in Materials Science被引用 0

一句话总结

RetroReasoner 引入了一个用于逆合成的具备推理能力的大语言模型，遵循化学家风格的键断裂策略，通过合成推理数据进行训练，并通过来回回报的强化学习来提高预测的可行性和多样性。

ABSTRACT

Retrosynthesis prediction is a core task in organic synthesis that aims to predict reactants for a given product molecule. Traditionally, chemists select a plausible bond disconnection and derive corresponding reactants, which is time-consuming and requires substantial expertise. While recent advancements in molecular large language models (LLMs) have made progress, many methods either predict reactants without strategic reasoning or conduct only a generic product analysis, rather than reason explicitly about bond-disconnection strategies that logically lead to the choice of specific reactants. To overcome these limitations, we propose RetroReasoner, a retrosynthetic reasoning model that leverages chemists' strategic thinking. RetroReasoner is trained using both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we introduce SyntheticRetro, a framework that generates structured disconnection rationales alongside reactant predictions. In the case of RL, we apply a round-trip accuracy as reward, where predicted reactants are passed through a forward synthesis model, and predictions are rewarded when the forward-predicted product matches the original input product. Experimental results show that RetroReasoner not only outperforms prior baselines but also generates a broader range of feasible reactant proposals, particularly in handling more challenging reaction instances.

研究动机与目标

以符合化学家键断裂策略的明确战略推理来驱动逆向合成预测.
开发一个数据生成框架（SyntheticRetro），在预测反应物的同时输出结构化推理。
通过对 SyntheticRetro 数据进行有监督微调（SFT）来训练 RetroReasoner，并利用来回回报的强化学习（RL）进行细化。
展示在准确性和反应物提案的可行性及多样性方面的提升，尤其在困难和罕见反应类型上。

提出的方法

SyntheticRetro 产生结构化推理数据（R1–R4）和链接文本，将化学家策略转化为推理数据。
RetroReasoner 以 Qwen3-8B 模型为起点，在两个阶段进行训练：基于 SyntheticRetro 的目标的 SFT，以及带有来回准确性回报的 RL。
RL 使用 GRPO（Group Relative Policy Optimization）配合前向合成验证器，对能够重复原始产物的反应物集合给予回报。
前向模型验证器从提议的反应物预测产物，以计算用于策略更新的来回回报。
评估包括贪心与采样指标，强调提出的反应路径的可行性与多样性。

实验结果

研究问题

RQ1显式的、类似化学家的战略推理是否能比纯预测型 LLM 提高逆向合成预测？
RQ2通过用 SyntheticRetro 推理数据进行训练并采用来回 RL，是否能获得更广泛、更多样、更可行的反应物提案？
RQ3在罕见模板和罕见原子/符号实例上，RetroReasoner 的表现和多样性如何？

主要发现

模型	Exact@1	Round-trip@1	Exact@100	Round-trip@100	Feasible Ratio	Template Diversity
Prediction-Only (SFT)	0.482	0.784	0.678	0.950	0.774	2.562
Prediction-Only (RL)	0.486	0.802	0.662	0.936	0.785	2.324
RetroReasoner (SFT)	0.512	0.812	0.734	0.944	0.765	3.898
RetroReasoner (RL)	0.526	0.826	0.724	0.952	0.786	3.186

RetroReasoner 在精确匹配和来回回报指标上均优于基线，在 Exact@100 和模板多样性方面有显著提升。
在先进行 SFT 再进行 RL 的设置中，准确性提高且可行的反应物空间更广，而 RL 会降低推理多样性以聚焦于可行区域。
RetroReasoner 在困难数据集上表现稳健，包括罕见模板和罕见原子/符号实例。
在结构化推理步骤之间加入链接文本显著提升了精确匹配与多样性。
来回回报扩展了可行的反应物空间，但要在保持高精确匹配指标的同时需要来回框架。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。