Skip to main content
QUICK REVIEW

[论文解读] When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

Bogdan Zagribelnyy, Ivan Ilin|arXiv (Cornell University)|Feb 3, 2026
Machine Learning in Materials Science被引用 0
一句话总结

该论文提出 ChemCensor,一种用于单步回合前向合成的基于可信度的度量,构建 CREED(约640万条反应)和 URSA-expert-2026 基准,并显示在 CREED 上对 Chemistry Constraint–Consistent Language Model (C3LM) 的训练能提升回合前向合成性能,相较于标准基线。

ABSTRACT

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

研究动机与目标

  • Motivate limitations of exact-match Top-K metrics in single-step retrosynthesis evaluation.
  • Propose ChemCensor to evaluate chemical plausibility using reaction precedents and functional-group context.
  • Create CREED as a large, validated reaction dataset for model training and benchmarking.
  • Introduce URSA-expert-2026 as an expert-annotated out-of-domain benchmark for SSRS.
  • Demonstrate that training C3LM on CREED yields superior performance on both URSA-expert-2026 and USPTO-50K-test compared to baselines.

提出的方法

  • Define ChemCensor as a precedent-based score (0-5) based on reaction centers (RC) and functional-group (FG) context with alignment to a USPTO-full-derived knowledge base.
  • Decompose reactions into RC and FG signatures and compare against a curated precedent library to assign CC scores.
  • Construct CREED (~6.4M reactions) using forward and retro generation pipelines, CC-based verification, and decontamination against USPTO-50K-test.
  • Assemble URSA-expert-2026 as a benchmark of 100 novel, expert-annotated targets with synthetic feasibility validation by chemists.
  • Train C3LM on CREED (with/without USPTO-full data), with Supervised Fine-Tuning and optional Reasoning traces; apply Reinforcement Learning Fine-Tuning using ChemCensor reward and MT-based signals.

实验结果

研究问题

  • RQ1Can chemical plausibility-based metrics better capture retrosynthetic quality than Top-K exact-match metrics?
  • RQ2How well do LLMs perform on SSRS when evaluated with ChemCensor and URSA-expert-2026 compared to traditional benchmarks?
  • RQ3Does training on a large, plausibility-verified dataset (CREED) improve generalization and RC/FGs compatibility in SSRS?
  • RQ4What is the impact of reasoning traces and reinforcement learning rewards on plausibility-oriented retrosynthesis outputs?
  • RQ5How transferable are improvements from CREED-trained models to standard benchmarks like USPTO-50K?

主要发现

  • ChemCensor provides a 0-5 plausibility score reflecting precedent support and RC/FG compatibility for retrosynthetic steps.
  • URSA-expert-2026 presents a harder out-of-domain benchmark where model performance drops compared with USPTO-50K; many baselines struggle with plausibility.
  • CREED contains ~6.4 million reactions and is validated for chemical plausibility, enabling diverse yet plausible SSRS outputs.
  • C3LM trained on CREED (CREED-only or CREED+USPTO-full) achieves best or near-best ChemCensor scores on both URSA-expert-2026 and USPTO-50K-test.
  • Reasoning-enabled fine-tuning improves Avg. PT-Max CC, indicating better coverage of plausible reaction contexts; RL with ChemCensor reward boosts plausibility but may affect diversity.
  • Molecular-Transformer rewards can degrade URSA-expert-2026 performance due to domain generalization limits.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。